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Abstract 

We  study  efficient  deterministic  parallel  algorithms  on  two  models:  restartable  fail-stop 
CRCW  PRtVMs  and  strongly  asynchronous  PRAMs.  In  the  iirst  model,  synchronous  processors 
are  subject  to  arbitrary  stop  failures  and  restarts  determined  by  an  on-line  adversary  and  involv¬ 
ing  loss  of  private  but  not  shared  memory:  the  complexity  measures  are  completed  work  (where 
processors  are  charged  for  completed  fixed-size  update  cycles)  and  overhead  ratio  (completed 
work  amortized  over  necessary  work  and  failures).  In  the  second  model,  the  result  of  the  com¬ 
putation  is  a  serializaton  of  the  actions  of  the  processors  determined  by  an  on-line  adversary; 
the  complexity  measure  is  total  work  (number  of  steps  taken  by  all  processors).  Despite  their 
differences  the  two  models  share  key  algorithmic  techniques. 

We  present  new  algorithms  for  the  Write-All  problem  (in  which  P  processors  write  ones  into 
an  array  of  size  .-V)  for  the  two  models.  These  algorithms  can  be  used  to  implement  a  simulation 
strategy  for  any  /V  processor  PRAM  on  a  restartable  fail-stop  P  processor  CRCW  PRAM  such 
that  it  guarantees  a  terminating  execution  of  each  simulated  N  processor  step,  with  Oflog"  N) 
overhead  ratio,  and  0(min{yV  -{-  Plog'.V  -f  A/logiV,  .V  •  (sub-quadratic)  completed 

work  (where  M  is  the  number  of  failures  during  this  step's  simulation).  We  also  show  that  the 
Write-All  requires  N  ^  P  A  Q(P\ogP)  completed/total  work  on  these  models  for  P  <  .V. 
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1  Introduction 

1.1  Context  of  this  work 

Tlie  model  of  parallel  computation  known  as  the  Parallel  Random  Access  Machine  or  PRAM 
[FVV  78]  has  attracted  much  attention  in  recent  years.  MAny  efficient  and  optimal  algoritluiis  have 
been  designed  for  it;  see  the  surveys  (EG  88.  KR  90).  The  PRAM  is  a  convenient  abstraction  that 
combines  the  power  of  parallelism  with  the  simpliciiy  of  a  RAM,  but  it  has  several  unrealistic 
features.  The  PRAM  requires:  (1)  simultaneous  access  (requiring  significant  bandwidth)  to  a 
shared  resource,  namely  memory;  (2)  global  processor  synchronization:  and  (3)  perfectly  reliable 
processors,  memory  and  interconnection  between  them.  The  gap  between  the  abstract  models 
of  parcillel  computation  and  realizable  parallel  computers  is  being  bridged  by  current  research. 
For  e.xample,  memory  access  simulation  in  other  architectures  is  the  subject  of  a  large  body  of 
literature  surveyed  in  [Val  90a];  for  some  recent  work  see  [IIP  89,  Ran  87,  Upf  89].  Asynchronous 
PRAMs  are  the  subject  of  (CZ  89,  CZ-90,  Gib  89,  MSP  90,  Nis  90].  Here  we  address  the  issues  of 
synchronization  and  reliability  of  PRAM  processors. 

In  [KS  89]  it  is  shown  that  it  is  possible  to  combine  efficiency  and  fault- tolerance  in  many  key 
PRAM  algorithms  in  the  presence  of  arbitrary  dynamic  fail-stop  processor  errors  (when  processors 
fail  by  stopping  and  do  not  perform  any  further  actions).  The  key  to  such  algorithm  design  is  the 
following  fundamental  problem,  called  the  Wriie-All  problem: 

Given  a  P -processor  PRAM  and  a  0-valued  array 
of  N  elements,  write  value  1  into  all  array  locations. 


'Phis  problem  was  formulated  to  capture  the  essence  of  the  computational  progress  that  can  be 
naturally  accomplished  in  unit  time  by  a  PRAM  (when  P  =  N).  In  the  absence  of  failures,  this 
problem  is  solved  by  a  trivial  and  optimal  parallel  assignment.  However,  it  is  not  obvious  how 
10  design  solutions  that  are  efficient  in  the  presence  of  failures  or  asynchrony.  (KS  89]  give  an 
algorithm  for  the  Write-All  problem  that  does  a  total  of  0(jYlog"  N)  work. 

The  iterated  Write-All  paradigm  is  employed  (independently)  in  (KPS  90]  and  [Shv  89]  to 
('Xtond  the  results  of  (KS  89]  to  arbitrary  PRiVM  algorithms  (subject  to  fail-stop  errors  without 
if'starts).  In  addition  to  the  general  simulation  technique,  (KPS  90]  analyzes  the  e.xpectcd  behavior 
of  several  solutions  to  Write-All  using  a  particular  random  failure  model.  (Shv  89]  presents  a 
rinlerministic  optimal  work  e-\ecution  of  PR.'\M  tilgorithms  subject  to  worst  case  failures  given 
parallel  slackness  (as  in  (Val  90b]). 

A  simple  randomized  algorithm  that  serves  as  a  basis  for  simulating  arbitrary  PR/V.M  algorithms 
im  an  asynchronous  PRAM  is  presented  in  (.MSP  90].  This  randomized  asynchronous  simulation  has 
vory  good  e.xpected  performance  for  the  Write-All  problem  when  the  tidversary  is  off-line.  Recently, 
[KPRS  90]  further  refined  the  results  of  (KPS  90]  to  produce  an  approach  that  leads  to  constant 
>'\ppcted  slowdown  of  PR>\,M  algorithms  when  the  power  of  the  adversary  is  restricted.  (KPRS  90] 
:iNo  improved  the  fail-stop  deterministic  lower  and  upper  bounds  of  (KS  89]  (by  log  log  A'  factors). 

The  work  presented  here  deals  with  dynamic  patterns  of  faults  and  the  dynamic  assignment  of 
processors  to  tt'isks.  Processors  in  our  algorithms  have  very  little  private  information  and  commu- 
nicaie  via  shared  memory.  For  recent  advances  on  coping  with  static  fault  patterns,  see  (K*  90]. 
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V\^e  consider  fault  granularity  at  the  processor  level;  for  recent  work  on  gate  granularities,  see 
(AU  90,  Pip  85,  Rud  85].  The  general  problem  of  assigning  active  processors  to  tasks  has  similari¬ 
ties  to  the  problems  of  resource  management  in  a  distributed  setting,  such  as  distributed  controllers 
of  [LGFG  86]  and  [AAPS  87].  Fault- tolerance  of  particular  network  architectures  is  also  studied  in 
(DPPU  86].  However,  the  distributed  computation  models,  the  algorithms,  ruid  their  analysis  are 
quite  different  from  the  parallel  setting  studied  here. 

1.2  Contributions  of  this  paper 

In  this  paper,  we  extend  the  fail-stop  model  of  [KS  89]  by  allowing  arbitrary  dynamic  restarts  of 
processors  (with  loss  of  private  memory).  We  also  consider  a  model  in  which  private  memory  is 
safe,  but  the  interactions  of  processors  with  each  other  through  shared  memory  can  no  longer  be 
assumed  to  be  synchronous,  .\lthough  the  models  differ  in  their  formal  definition,  some  algorithms 
work  equally  weU  in  both  models. 

In  the  restartable  fail-stop  model,  defined  precisely  in  Section  2.1,  PR.AM  processors  are  sub¬ 
ject  to  on-line  (dynamic)  failures  and  restarts.  We  concentrate  on  the  worst  case  analysis  of  the 
completed  work  of  deterministic  algorithms  that  are  subject  to  arbitrary  adversaries,  and  on  the 
overhead  ratio,  which  amortizes  the  work  over  the  necessary  work  and  failures/restarts.  In  this 
model,  processors  fail  and  then  restart  in  a  way  that  makes  it  possible  tc  develop  terminating  algo¬ 
rithms,  while  relaxing  the  requirement  that  one  processor  must  never  fall  (which  was  necessary  in 
the  fail-stop  without  restart  model).  To  guarantee  algorithm  termination  and  sensible  accounting  of 
resources,  we  introduce  an  update  cycle,  that  generalizes  the  standard  PRAM  read/compute/writc 
cycle.  In  the  absence  of  update  cycles,  a  thrashing  adversary  exploiting  the  separation  of  read  and 
write  instructions  in  PRAMs  can  force  quadratic  work  for  any  Write-All  solution.  The  restartable 
PR'VM  model  is  defined  in  Section  2.1,  which  also  contains  a  discussion  of  the  technical  choices 
made. 

The  strongly  asynchronous  model  is  defined  in  Section  2.2,  In  this  n.odel,  we  use  Lamport’s 
notion  of  rerializability  [Lam  86],  which  states  that  the  effect  of  a  parallel  computation  should  be 
consistent  with  some  serialization  of  atomic  processor  actions.  We  consider  the  serialization  to 
l)p  chosen  by  an  on-line  adversary,  and  use  atomic  reads  and  atomic  writes  (other  primiiivtiS  are 
considered  as  well).  This  model  is  related  to  other  models  known  as  asynchronous  PRAMs  [CZ  89, 
rZ  90,  Gib  89,  MSP  90,  Nis  90];  perhaps  the  closest  of  these  is  [MSP  90],  although  this  reference 
considers  only  off-line  (pre-specified)  adversaries  and  randomized  algorithms.  The  relationship  of 
the  two  models  in  Sections  2.1  and  2.2  is  discussed  in  Section  2.3;  some  practical  motivation  is  also 
discussed  in  Section  1.3  below. 

In  Section  3,  we  present  lower  bounds  for  the  Wrile-.All  problem.  When  reads  and  writes 
are  .accounted  together  in  update  cycles  of  the  restJirtable  fail-stoo  model,  the  quadratic  lowe; 
liound  .mentioned  above  no  longer  applies.  Instead,  we  show  that  the  Write-All  problem  of  size 
requires  N  -  P  t  D{P\ogP)  completed  work  for  P  <  N.  This  bound  also  holds  in  the  strongly 
asynchronous  model.  It  holds  when  processors  can  read  and  loc.ally  process  the  entire  shared 
memory  .at  unit  cost.  Under  these  .assumptions  it  is  the  tightest  possible  bound.  .-Vn  n(A'log:V) 
lower  bound  when  P  =  N  w.as  recently  shown  in  [KPRS  90]  using  a  different  technique  and  different 
assumptions  for  a  fail-stop  no-restart  model.  Our  lower  bound  results  are  of  interest  l)ec.ause: 
5 a)  they  demonstrate  that  any  improvement  to  the  lower  bound  must  t.ake  .account  of  the  fact  th.at 
processors  can  read  only  a  const.ant  number  of  cells  in  const.ant  time,  and  (b)  they  present  a  simple 
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processor  allocation  strategy  that  we  use  to  advantage  in  Section  4.  We  also  demonstrate  a  lower 
bound  of  +  fl(PlogiV)  (when  3  <  P  <  iV)  for  the  strongly  asynchronous  PRAM,  when  certain 
atomic  primitives  (such  as  compare-and-swap  or  test-and-set)  are  used  to  access  shared  memory. 

In  Section  4  we  present  three  efficient  algorithms  for  the  Write- All  problem.  The  first  (algorithm 
V')  is  a  modification  of  the  algorithm  developed  in  (KS  89]  for  the  fail-stop  no-restart  model,  and 
runs  on  the  restartable  fail-stop  model  with  completed  work  0(iV  -fPlog^  N  -f  M  log  jV),  where  M 
is  the  number  of  failures.  This  algorithm  is  based  on  an  analysis  of  the  lower  bounds  in  Section  3. 
The  second  (algorithm  A')  runs  on  both  models  in  time  0{N  The  third  (algorithm  T) 

runs  on  both  models  in  the  case  P  =  3,  using  N  -f  0(log  jV)  compare-and-swap  operations  on  the 
strongly  asynchronous  model  and  N  +0(logiV)  update  cycles  in  the  fail-stop  restart  model.  This 
matches  the  lower  bound  when  three  processors  are  used. 

In  Section  5,  we  show  how  to  use  algorithms  V  and  X  to  simulate  any  N  proce-ssor  PRAM 
on  a  restartable  fail  stop  P  processor  CRCW  PRAM.  A  terminating  execution  of  each  simulated 
V  processor  step  is  guaranteed  with  0(log*  :V)  overhead  ratio,  and  (sub-quadratic)  completed 
work  0(min{iV  +  Plog^  -f  A/logA^  N  ■  P’°*^2}),  where  M  is  the  number  of  failures  during 
•  he  simulation  of  the  particular  step.  The  strategy  is  work-optimal  when  the  number  of  simulating 
l)rocessors  is  P  <  N/  log*  N  and  the  total  number  of  failures  per  each  simulated  step  is  0{N/  log  A'). 

The  lower  bounds  presented  in  Section  3  apply  to  the  worst-case  work  of  deterministic  algorithms 
and  to  the  expected  work  of  randomized  and  deterministic  algorithms.  Randomization  does  not 
«eem  to  help,  given  on-line  (non-prespecificd)  patterns  of  failures.  For  example,  it  is  easy  to 
construct  on-line  failure  and  restart  (resp.  no-rcstart)  patterns  that  lead  to  exponential  (resp. 
quadratic)  in  N  expected  performance  for  the  algorithms  presented  in  [MSP  90).  These  stalking 
adversaries  are  described  in  Section  6,  where  we  also  conclude  with  some  open  problems. 

Preliminary  versions  of  this  work  were  reported  in  [BR  90,  KS  91). 


1.3  Motivation  and  relation  to  physical  systems 

The  models  we  present  and  study  are  intended  to  capture  certain  features  of  actual  systems. 

Processor  delay  is  a  feature  of  any  multi-user  environment,  in  which  processing  priorities  are  not 
'ppcifiod  by  a  single  user.  Processing  time  may  be  required  at  a  moment's  notice  by  another  user 
or  by  the  underlying  operating  system. 

Processor  failure  may  occur  cither  because  of  a  physical  fault  or  because  another  entity  in  the 
sy.stcm  preempts  processing  time  without  saving  the  old  stale. 

Communication  delay  is  a  well-known  feature  of  multi-processor  sjsiems.  Small  communic<ilion 
'lolays  are  compatible  with  synchronization  if  the  step  lime  is  sufficient  for  the  longest  possible 
•tcrnss  time,  but  ivnchronizing  by  counting  up  to  the  longest  po.ssiblc  access  time  eliminates  any 
advanl.ages  due  to  caching  and  similar  techniques. 

Communication  failure  may  be  due  to  memory  operations  of  other  processors.  The  interacting 
npnrations  need  not  involve  the  same  memory  module.  If  the  communication  network  reports  the 
•’ailure  of  an  oi)crat'on.  the  processor  can  re-attempt  the  access,  and  the  situation  can  be  modelled 
a  communication  delay.  If  unannounced  failures  can  occtir.  an  algorithm  must  cither  explicitly 
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Figure  1:  A  robust  fail-stop  multiprocessor. 


riicck  its  write  operations  or  ensure  iiusome  other  way  that  omission  of  a  write  is  not  detrimental 
to  performance. 

In  this  paper,  we  treat  delay  and/or  failure  as  occurring  to  the  processors  only.  If  memory 
operations  are  atomic;and  serializable,  they  may  be  assumed  to  be  instantaneous,  and  the  commu¬ 
nication  delays  or  access  failures  may  be  charged  to  the  processor. 

.An -architecture  for  a  restartable  fail-stop  multiprocessor:  The  main  goal  of  this  work  is  to 
ciudy  algorithmic  techniques  that  enable  efficient  parallel  computation  oh  realizable  multiprocessor 
systems.  We  now  suggest  one  way  of  realizing  the  abstract  model  of  computation  where  processors 
arc  subject  to  fail-stop  errors  and  restarts,  i.e.,  the  model  formalized  in  Section  2.1. 

Engineering  and  technological  approaches  exist  that  allow  implementing  electronic  components 
and  systems  that  operate  correctly  when  subjected  to  certain  failures  (for  e.xamples,  see  (IEEE  90, 
f'ri  -91]).  The  technologies  cited  in  the  next  paragraph  aje  instrumental  in  providing  basic  hardware 
fault-tolerance  for  a  foundation  on  which  the  algorithmic  and  software  faultriolerance  can  be  built. 

Semiconductor  memories  arc  the  essential  components  of  shared  memory  parallel  systems. 
^Ipmories  are  routinely  manufactured  with  built-in  fault  tolerance  Ujing  replication  and  coding 
'nrhniques  without  appreciably  degrading  performance  (see  the  survey  (SM  84]).  Interconnection 
■toiAvorks  are  typically  used  in  a  multiprocessor  system  to  provide  communication  among  processors, 
'uomory  modules  and  other  devices  (c.g.,  as  in  the  UUracompuler  (Sch  SO]).  The  fault-tolerance  of 
iiiierconnection  network*;  has  been  the  subject  of  much  work  in  its  own  turn.  The  networks  are 
'Made  more  reliable  by  employing  redundancy  (see  the  survey  (.-V.-VS  87]).  A  combining  interconnec- 
•inn  network  that  is  perfectly  suited  for  implementing  synchronous  concurrent  reads  and  writes  is 
formally  treated  in  [KRS  88].  Finally,  fail-stop  processors  are  formtilly  presented  and  justified  in 
[SS  S3]. 

The  abstract  model  that  we  are  studying  can  be  realized  (Figure  1 }  in  the  following  architecture, 
using  the  components  we  have  just  discussed: 

1  There  are  P  fail-stop  processors,,each  with  a  unique  address  and  some  amount  of  local  mem¬ 
ory.  Processors  are  unreliable. 

2.  There  are  Q  addressable  shared  memory  cells.  The  input  of  size  N  <  Q  iu  stored  in  shared 
memory.  This  memory  is  assumed  to  be  reliable. 

3  Interconnection  of  processors  and  memory  is  provided  by  asynchronous  combining  intercon¬ 
nection  network.  This  network  is  assumed  to  be  reliable. 
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With  this  architecture,  our  algorithmic  techniques  become  completely  applicable;  i.e.,  the  algo¬ 
rithms  and  simulations  we  develop  will  work  correctly,  and  within  the  complexity  bounds  (under 
I  he  unit  cost  memory  access  assumption)  for  all  patterns  of  processor  failures  and  restarts  when 
the  underlying  components  are  subject  to  the  failures  within  their  respective  design  parameters. 


2  Models  of  computation 

2.1  The  restartable  fail-stop  CRCW  PRAM 

We  use  as  a  basis  the  PRAM  model  (FW  78],  where  all  concurrently  writing  processors  write  the 
•jamo  value  (COM.MON  CRCW).  Processors  are  subject  to  stop  failures  and  restarts  as  in  (SS  83). 
Our  algorithms  are  described  using  the  forall/parbegin/parend  parallel  construct. 

1.  There  are  P  synchronous  processors.  Each  processor  has  a  unique  permanent  idcntiiler  (pid) 
in  the  range  0, . ■.,P  —  1,  and  each  processor  has  access  to  P  and  its  own  PID. 

2.  The  global  memory  accessible  to  all  processors  is  denoted  as  shared:  in  addition,  each  pro¬ 
cessor  has  a  constant  size  local  memory  denoted  as  private.  All  memory  cells  are  capable  of 
storing  0(logmax{iV,  P})  bits  on  inputs  of  size  N. 

3.  The  input  is  stored  in  N  cells  in  shared  memory,  and  the  rest  of  the  shared  memory  is  cleared 
(i.e.,  contains  zeroes).  The  processors  have  access  to  the  input  and  its  size  N. 

In  all  our  algorithms: 

•  The  PRAM  processors  c.\ecutc  sequences  of  instructions  grouped  in  update  cycles.  Each  up¬ 
date  cycle  consists  of  reading  a  small  fixed  number  of  shared  memory  cells  (c.g..  •!  j,  performing 
some  fi.xed  time  computation,  and  writing  a  small  number  of  shared  memory  cells  (e.g.,  2). 

The  parameters  of  the  update  cycle,  i.e.,  the  number  of  read  and  write  instructions,  arc  fixed, 
but  depend  on  the  instrtiction  set  of  the  PR.AM;  see  (FW  78]  for  a  typic.al  PR.AM  instruction  set. 
riif  values  quoted  (4  and  2)  are  stifficient  for  our  exposition.  It  is  an  interesting  question  whether 
smaller  values  would  suffice  to  implement  efficient  algorithms. 

We  use  the  fail-slop  with  restart  failure  model,  where  time  instances  arc  the  PR.AM  synchronous 
•'lock-ticks: 


1 .  .-X  failure  pattern  F  (i.e.,  failures  and  restarts)  is  determined  by  an  on-line  adversary,  that 
knows  everything  about  the  algorithm  and  is  unknown  to  the  algorithm. 

2.  Any  processor  may  fail  at  any  time  during  any  update  cycle,  or  having  failed  it  may  restart 
at  any  time,  provided  that: 

M)  at  any  time  at  least  one  procc,ssor  is  e,xccHting  an  update  cycle  that  successfully  completes; 

(ii)  single  bit  writes  arc  atomic,  i.e.,  failures  can  occur  before  or  after  a  write  of  a  single  bit. 

’{  Failures  do  not  affect  the  shared  memory,  but  the  failed  processors  lose  their  private  memory. 
Processors  are  restarted  at  their  initial  slate  with  their  riD  as  their  only  knowledge. 
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The  failure  and  restart^patterns  are  syntactically  defined  as  follows: 

Definition  2.1  A  failure  pattern  F  is  a  set  of  triples  <iag,  IMD,  t  >  where  tag  is  either  failure 
indirating  processor  failure,  or  restart  indicating  a  processor  restart,  PID  is  the  processoridentifier, 
andzt  is  the  time  indicating  when  the  processor  stops  or  restarts.  The  sire  of  the  failure  pattern  F 
is  defined  as  the  cardinality  |F|.  □ 


For  simplicity  of  presentation,  we  assume  that  the  shared  memory  writes  of  0(logmax{iV,  F}) 
l)it  words  are  atomic.  Algorithms  using  this  assumption  can  be  easily  converted  to  use  only  single 
bit  atomic  writes  as  in  [KS  89j. 

We  investigate  two  natural  comple-Kity  measures,  completed  work  and  overhead  ratio.  The 
completed  work  measure  generalizes  the  standard  Parallel-time  x  Processors  product  and  the 
available  processor  steps  of  (KS  89].  Tlic  overhead  ratio  is  an  amortized  measure. 

Definition  2.2  Consider  an  algorithm  with  F  initial  processors  that  terminates  in  parallel-time  r 
after  completing  its  task  on  some  input  data  I  and  in  the  presence  of  a  failure  pattern  F.  If 
F[(/,F)  <  F  is  the  number- of  processors  completing  an  update  cycle  at  time  i,  and  c  is  the-time 
required  to  complete  one  update  cycle,  then  we  define  5(/,  F,  F)  as: 

5(/,F,F)  =  c^Ff(/,F).  O 

t=i 

Update  cycles  arc  units  of  accounting.  They  do  not  constrain  the  instruction  set  of  the  PRAM, 
and  failures  can  occur  between  the  instructions  of  an  update  cycle.  However,  in  5(7,  F.F)  the  pro¬ 
cessors  arc  not  charged  for  the  read  and  write  instructions  of  update  cycles  that  arc  not  completed. 


Definition  2.3  A  F-proccssor  PR.\M  .algorithm  on  any  input  data  7  of  size  |7j  =  .Y,  and  in  the 
presence  of  any  pattern  F  of  failures  and  restarts  of  size  |F|  <  M, 

•  uses  completed  v:ork  S  =  =  ma.\{5(7,F,  F)}  ,  and 

i»F 


has  overhead  ratio  a  —  o.v  p  —  max 

l.F 


.  □ 


Consider  a  definition  of  total  icork  S'{I,  F.  F)  that  also  coiinls  incomplete  update  cycles.  Clearly 
V'f  7,  F,  F)  <  5(7,  F,  F)-f  cjF|.  Thus,  using  S'  does  asymptotically  affect  the  measure  of  work  (when 
JFj  is  very  large),  but  it  docs  not  aj>ymplolically  affect  cr. 

One  might  also  generalize  the  overhead  ratio  as  where  r(i7i)  is  the  time  complexity 

•^f  »he  best  sequential  solution  known  to  date  for  the  particular  problem  at  hand.  For  the  purposes 
'if  this  exposition,  it  is  sufficient  to  express  o  in  terms  of  the  ratio  ■  This  is  because  for 

IVritc-iUl  (by  itself  and  as  used  in  the  simulation)  r(|7|)  =  0(|7|). 

.Vow  let  us  briefly  comment  on  the  technical  choices  made  in  Definitions  2.2  and  2.Z. 
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Work  vs.  overhead  ratio:  For  arbitrary  processor  failures  and  restarts,  the  completed  work 
measure  S  (or  the  total  work  S')  depends  on  the  size  A'  of  the  input  /,  the  number  of  processors 
P,  and  the  size  of  the  failure  pattern  F.  The  ultimate  performance  goal  for  a  parallel  fault-tolerant 
algorithm  is  to  perform  the  required  computation  at  a  work  cost  as  close  as  possible  to  the  work 
performed  by  the  best  sequential  algorithm  known.  Unfortunately,  this  goal  is  not  attainable  when 
ail  adversary  succeeds  in  causing  too  many  processor  failures  during  a  computation. 

Example  A:  Consider  a  Write-All  solution,  where  it  takes  a  processor  one  instruction  to  recover 
from  a  failure.  If  an  adversary  in  a  failure  pattern  F  with  the  number  of  failures  and  restarts 
jF|  for  £■  >  0.  then  the  completed  work  will  be  and  thus  already  non-optimal 

and  potentially  large,  regardless  of  how  efficient  the  algorithm  is  otherwise.  Yet  the  algorithm  may¬ 
be  extremely  efficient,  since  it  takes  only  one  instruction  to  handle  a  failure.  □ 

This  illustrates  the  need  for  a  measure  of  efficiency  that  is  sensitive  to  both  the  size  of  the 
input  N,  and  the  number  of  failures  and  restarts  M  =  |Fj.  When  M  =  0(P)  as  in  the  case  of 
I  he  stop  failures  without  restarts  in  (KS  S9],  S  properly  describes  the  algorithm  efficiency,  and 
n  =  However,  when  F  can  be  large  relative  to  A'  and  F  (as  is  the  case  when  restarts 

are  allowed)  <t  better  reflects  the  efficiency  of  a  fault-tolerant  algorithm.  Recall  that  a  is  insensitive 
to  the  choice  of  5  or  S',  and  to  using  update  cycles,  as  a  measure  of  work.  However,  update  cycles 
are  necessary  for  the  following  two  reasons. 

Update  cycles  and  termination:  Our  failure  model  requires  that  at  any  time,  at  least  one 
processor  is  e>:ecuting  an  update  cycle  that  completes.  (This  condition  subsumes  the  condition  of 
[KS  f'O]  that  one  processor  does  not  fail  during  the  computation).  This  requirement  is  formulated 
in  terms  of  update  cycles  and  assures  thaUsomc  progress  is  nmde.  Since  the  processors  lose  their 
context  after  a  failure,  they  have  to  read  something  to  regain  it.  Without  at  least  one  active  update 
cycle  completing,  the  adversary  can  force  the  PR.AM  to  thrash  by  allowing  only  these  reads  to  be 
performed.  Similar  concerns  arc  discussed  in  {SS  §3]. 

Update  cycles  as  a  unit  of  accounting:  In  our  definition  of  completed  work  we  only  count 
completed  update  cycles.  Even  if  the  progress  and  termination  of  a  computation  is  assured  (by 
always  completely  e.xccuting  at  least  one  update  cycle),  but  the  processors  are  charged  for  incom¬ 
plete  update  cycles,  the  work  S'  of  any  algorithm  that  simulates  a  single  A'  processor  PR.-\M  step 
is  at  least  Q(P-N).  The  reason  for  this  quadratic  behavior  in  S'  is  the  following  simple  and  rather 
siniiilcresting  thrashing  adversary. 

E.xamplo  B:  We  evaluate  the  work  of  any  solution  for  the  Write-All  problem  under  the  arbitrary 
failure  and  restart  model.  Consider  the  standard  PRAM  rcad-computc-writc  cycle  (if  processors 
fiegin  writing  without  reading,  a  simple  modification  of  the  argument  leads  to  the  same  result). 
\  ihmshing  adversary  allows  all  processors  to  perform  the  read  and  compute  instructions:  then  it 
'ails  all  but  one  processor  for  the  write  operation.  Failed  processors  are  then  restarted.  Since  one 
write  operation  is  performcil  per  cycle,  .V  cycles  will  be  rcquircti  to  initialize  .V  array  elements. 
Each  of  the  P  processors  performs  0(A')  instructions  which  results  in  work  of  0(F  -  A' ).  O 

By  charging  the  processors  only  for  the  complctcii  fi.xcil  size  upilate  cycles  we  tio  not  charge  for 
‘iirashing  adversaries.  This  change  in  cost  measure  allows  sub-quadratic  solutions. 
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2.2  The  strongly  asynchronous  PRAM 

Tlie  strongly  asynchronous  PR.*\M  model  departs  from  the  standard  PRAM  models  in  that  the 
processors  are  completely  asynchronous.  The  only  synchronizing  assumption  is  that  reads  and 
writes  to  memory  arc  atomic  and  serializable,  in  the  sense  of  Lamport  [Lam  86].  Scrializability 
means  that  the  result  of  a  computation  is  consistent  with  some  total  ordering  of  atomic  actions. 
(Note  that  this  docs  not  mean  that  the  actions  arc  in  fact  ordered  this  way,  but  that  the  effect  of 
the  computation  is  as  if  they  were.)  This  is  a  restriction  on  the  possible  outcome  of  simultaneous 
events.  With  asynchronous  processors,  the  distinction  between  exclusive  writes  and  concurrent 
writes  disappears.  Among  the  traditional  synchronous  PR.A.M  models,  the  arbitrary  CRCW 
PRAM  is  closest  to  the  strongly  asynchronous  model. 

One  important  situation  that  is  modelled  by  the  strongly  asynchror»ous  PRAM  is  the  case 
in  which  the  processors  arc  “nearly  synchronous.”  If  identical  processors  access  shared  memory 
across  a  common  communication  channel  or  network,  then  they  will  run  at  approximately  the  same 
■jpeed,  but  the  precise  interleaving  of  memory  operations  may  act  be  under  the  direct  control  of 
I  he  processors.  To  model  the  lack  of  control  over  the  interleaving,  we  posit  an  on-line  adversary 
tliat  chooses  the  interleaving  to  maximize  the  cost  of  the  compulation.  The  .adversary  is  free  to 
delay  any  processor  for  any  length  of  time. 

Definition  2.4  We  define  an  interleaving  to  be  a  sequence  of  processor  numbers,  each  in  the  range 
[0,  P  -  Ij.  .An  execution  of  a  PR.AM  algorithm  consistent  with  a  particular  interleaving  is  the 
execution  of  steps  by  the  processors  in  the  order  specified  by  the  interleaving.  □ 

Definit'on  2.5  The  measure  of  the  efficiency  of  .aslrongly  .asynchronous  PRAM  is  the  tot.'il  number 
of  steps  completed,  which  we  term  the  total  u:ork  of  the  computation  (expressed  in  terms  of  P  and 
the  input  size  A'}.  To  define  total  work,  we  assume  that  each  processor  ^ccutcs  a  luall  instruction 
when  it  terminates  work  on  the  .algorithm.  In  order  for  the  algorithm  to  be  correct,  it  must  be  the 
•'.ISO  that  at  this  point,  the  postconditions  for  the  .algorithm  arc  satisfied.  The  total  work  of  an 
.ilgorithm  with  respect  to  a  given  interleaving  is  the  length  of  the  smallest  halt-free  prefix  of  that 
iiiterlcJiving.  The  total  work  required  by  .an  algorithm  is  then  the  maximum  total  work  over  all 
possible  interleavings  of  the  processors.  (Note  that  in  this  worst  c.isc,  all  processors  will  be  rc.ady 
lo  c.xccHtc  h.tlt  instructions.JD 

Previous  work  along  these  lines  has  assumed  cither  that  randomized  algorithms  can  be  used 
'o  defeat  off-line  adversaries  ([MSP  90])  or  that  interleavings  arc  chosen  according  to  some  proba¬ 
bilistic  distribution  f(CZ  90.  Nis  90]).  Some  of  the  models  in  these  last  two  papers  arc  similar  to 
our  rcstartable  fail-stop  model,  bnt  failures  are  probabilistic  and  restarts  do  not  destroy  private 
•nemory.  Because  of  our  worst  c.asc  assumptions,  these  analyses  are  inappropriate.  Furthermore, 
•soiions  of  lime  used  in  [CZ  90]  do  not  work  here,  because  our  schcfiuling  adversary  m.ay  introduce 
.-tri)ilrari!y  long  delays. 

The  notion  of  icait-frcc  asynchronous  computation,  in  which  any  one  processor  terminates  in  a 
•uiitc  number  of  steps  regardless  of  the  speeds  of  the  other  processors,  is  introduced  in  (Her  88].  In 
'he  strongly  asynchronous  PR.\M.  by  definition  any  algorithm  with  bounded  work  must  be  wail- 
<‘reo.  The  same  paper  shows  that  atomic  reads  ami  writes  arc  insutlicicnt  lo  solve  iwo-prQc<»sor 
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coasensus,  and  demonstrates  a  hierarchy  of  stronger  primitives  for  accessing  memory  (such  as  test- 
and-set  or  compare-and-swap).  A  later  paper  ((AH  90])  demonstrates  wait-free  data  structures 
using  only  atomic  reads  and  writes. 

Finally,  we  note  that  the  strongly  asynchronous  model  is  a  very  general  one,  and  it  is  subject 
to  fewer  definitional  restrictions  than  is  its  fail-stop  restartable  counterpart.  However,  as  a  result 
nf  such  restrictions,  the  fail-stop  model  can  be  used  for  general  synchronous  PRAM  simulations  (as 
wp  show  in  Section  5),  while  the  strongly  asynchronous  model  cannot  be  used  for  such  simulations 
due  to  impossibility  results  such  as  (Her  88]. 

2.3  Comparison  of  the  models 

On  the  surface,  the  two  models  of  restartable  fail-stop  processors  and  of  asynchronous  processors 
are  designed  for  quite  different  situations.  The  fail-stop  model  treats  failure  as  an  abnormal  event, 
which  occurs  with  sufficient  frequency  that  it  cannot  be  ignored.  The  asynchronous  model  treats 
delay  as  a  normal  occurrence.  Nevertheless,  the  two  models  are  closely  related. 

Consider  an  execution  of  an  asynchronous  algorithm.  Because  the  events  are  serializable,  we 
may  assume  without  loss  of  generality  that  the  events  occur  at  discrete  times.  In  other  words,  a  set 
of  time  slices  is  fixed  in  advance,  and  the  scheduling  adversary  chooses  at  each  time  slice  whether  or 
not -each  processor  will  start  running  during  that  time  slice.  From  this  viewpoint,  the  two  models 
differ  in  the  following  ways. 

1.  Processors  that  miss  a  time  slice  lose  their  internal  state  in  the  restartable  fail-stop  case,  and 
keep  their  internal  state  in  the  asynchronous  case. 

2.  The  adversary  can  stop  a  processor  after  any  memory  operation  within  a  time  slice  in  the 
restartable  fail-stop  case  while  this  has  no  effect  on  the  asynchronous  case. 

3.  The  time  slices  are  long  enough  for  several  memory  operations  in  the  restartable  fail-stop  case 
but  allow  only  a  single  operation  in  the  asynchronous  case. 

From  the  algorithmic  point  of  view,  the  difference  between  the  models  concerns  the  number  of 
failures  during  an  e.xecution  of  the  algorithm.  In  the  restartable  fail-stop  model,  failure  is  treated 
as  a  significant  event,  and  the  number  of  failures  may  be  taken  into  account  when  measuring  the 
pfficiency  of  the  algorithm.  In  the  asynchronous  model,  delay  is  the  rule  rather  than  the  exception, 
and  the  number  of  delays  is  not  a  particularly  meaningful  quantity.  A  normal  execution  may  involve 
many  delays  of  each  processor  between  each  consecutive  step. 

An  algorithm  that  performs  a  bounded  amount  of  work  for  any  number  of  failures,  and  has  a 
small  amount  of  state  information,  is  suitable  for  either  model.  An  algorithm  whose  performance 
degrades  significantly  as  the  number  of  failures  increases,  however,  may  only  be  suitable  for  the 
restartable  fail-stop  model.  Algorithms  W  and  V  (as  presented  in  Section  4)  are  examples  of  the 
latter  case;  algorithms  A'  and  T  exemplify  the  former  case. 
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3  Lower  bounds  for  the  Write- All  problem 

3.1  Lower  bounds  with  memory  snapshots 

As  we  have  shown  in  Example  B  in  Section  2.1,  without  the  update  cycle  accounting  there  is 
a  thrashing  adversary  that  exhibits  a  quadratic  lower  bound  for  the  Write-All  problem  in  the 
n'srartable  fail-stop  model.  With  the  update  cycle  accounting  and  for  the  asynchronous  model, 
we  show  N  —  P  A-  log work  lower  bounds  (when  P  <  N)  for  both  models,  even  when  the 
processors  can  take  unit  time  memory  siiapshots,  i.e.,  processors  can  read;  and  locally  process  the 
entire  shared  memory  at  unit  cost. 

Theorem  3.1  Given  any  ^processor  CRCW  PRAM  algorithm  that  solves  the  Write-All  problem 
of  size  N  {P  <  N),  an  adversary  (that  can  cause  arbitrary  processor  failures  and  restarts)  can  force 
the  algorithm  to  perform  N  -:P  +  Q{P\ogP)  completed  work  steps. 

Proof:  Let  Z  be  any  algorithm  for  the  Write-All  problem  subject  to  arbitrary  failure/restarts 
using  update  cycles.  Consider  each  PRAM  cycle.  The  adversary  uses  the  following  strategy: 

Let  £/  >  1  be  the  number  of  unvisited  array  elements.  For  as  long  U  >  P,  the  adversary 
induces  no  failures.  The  wofkineeded  to  visit  N  -  P  array  elements  when  there  were  no  failures  is 
at  least  N  -  P. 

As  soon  as  a  processor  is  about  to  visit  the  element  N  -  P  +  L  making  U  <  P,  the  adversary 
fails  and  then  restarts  all  ;V  processors.  For  the  upcoming  cycle,  the  adversary  determines  how 
I  ho  algorithm  assigns  processors  to  write  to  array  elements.  By  an  averaging  argument,  for  any 
juocossor  assignment  to  the  U  elements,  there  is  a  set  of  unvisited  elements  with  no  more 
Mian  fy]  processors  assigned  to  them.  The  adversary  fails  these  processors,  allowing  all  others  to 
proceed.  Therefore  at  least  [yj  processors  will  complete  this  step  having  visited  no  more  than  half 
of  the  remaining  unvisited  array  locations. 

This  strategy  can  be  continued  for  at  least  log  P  iterations.  The  work  performed  by  the  algo¬ 
rithm  will  be  5  >  iV  -  P  +  [f  J  logP  =  iV  -  P  -}-  fi(PlogP).  □ 

Vote  that  the  bound  holds  even  if  processors  are  only  charged  for  writes  into  the  array  of  size  N 
and  do  not  have  to  only  write  the  value  1.  The  simplicity  of  this  strategy  ensures  that  the  results 
hold  in  the  strongly  asynchronous  model. 

Theorem  3.2  Any  iV-processor  strongly  asynchronous  PRAM  algorithm  that  solves  the  Write-All 
problem  of  size  N  has  total  work  N  -  P  +  ft(PlogP). 

Proof:  .4ny  possible  execution  of  an  algorithm  on  the  restartable  fail-stop  model  can  be  duplicated 
''v  an  appropriate  interleaving  on  the  strongly  asynchronous  model.  The  argument  in  Theorem  3.1 
works  even  if  failed  processors  do  not  lose  local  state,  and  so  the  same  strategy  will  work  in  the 
strongly  asynchronous  modeh  □ 

ritis  lower  bound  is  the  tightest  possible  bound  under  the  assumption  that  the  processors  can 
■wad  and  locally  process  the  entire  shared  memory  at  unit  cost.  .Although  such  an  assumption  is 
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very  strong,  we  present  the  matching  upper  bound  for  two  reasons.  First,  it  demonstrates  that 
any  improvement  to  the  lower  bound  must  take  account  of  the  fact  that  processors  can  read  only  a 
constant  number.of  cells  per  update  cycle.  Second,  it  presents  a  simple  processor  allocation  strategy 
that  we  use  to  advantage  in  the  next  section. 

Theorem  3.3  If  processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost, 
then  a  solution  for  the  Write-All  probleni  in  the  restartable  fail-stop  model  can  be  constructed 
such  that  its  completed  work  using  P  processors  on  input  of  size  is  S  =  N  -  P  +  O(PlogP), 
when  P  <  N. 


Proof:  The  processors  follow  the  following  simple  strategy:  at  each  step  that  a  processor  PID  is 
active,  it  reads  the  iV  elements  of  the  array  a:[l..Ar]  to  be  visited.  Say  U  of  these  elements  are  still 
not  visited.  The  processor  numbers  these  U  elements  from  I  to  U  based  on  their  position  in  the 
array,  and  assigns  itself  to  the  ith  unvisited  element  such  that  t  =  [PID  •  ^].  This  achieves  load 
balancing  with  no  more  than  f^]  processors  cissigned  to  each  unvisited  element.  The  reading  and 
local  processing  is  done  as  a  snapshot  at  unit  cost. 

We  list  the  elements  of  the  Write-All  array  in  ascending  order  according  to  the  time  at  which 
ihe  elements  are  visited  (ties  are  broken  arbitrarily).  We  divide  this  list  into  adjacent  segments 
numbered  sequentially  starting  with  0,  such  that  the  segment  0  contains  Vq  =  N  -  P  elements, 
and  segment  j  >  1  contains  Vj  =  elements,  for  j  =  and  for  some  m  <  y/P.  Let 

Uj  be  the  least  possible  number  of  unvisited  elements  when  processors  were  being  assigned  to  the 
elements  of  the  jth  segment,  f/,-.  can  be  computed  as  Uj  =  N  -  J2i=o  is  of  course  N,  and 

for  j  >  1,  Uj  =  P  -  Vi  >  P  -  (P  -  j)  =  j.  Therefore  no  more  than  \^\  processors  were 
assigned  to  each  element. 


The  work  performed  by  such  an  algorithm  is: 


■V  <  E 

i=o 


Uj 


<  vb+E 

j=i 


j{3 +  l)i 


Plj 


=  Fo  +  0  (pEj:^)  =I^'-P  +  0{P\ogP).  □ 


.4  similar  situation  holds  in  the  strongly  asynchronous  model. 


Theorem  3.4  If  processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost, 
I  hen  a  solution  for  the  IFrife-A//  problem  in  the  strongly  asynchronous  model  can  be  constructed 
wiih  total  work  N  -  F  +  O(PlogP)  using  P  processors  on  input  of  size  N,  for  P  <  N. 

Proof:  We  use  the  same  algorithm  as  in  the  previous  proof.  The  proof  itself  applies  to  the  strongly 
nsynchronous  model  with  the  following  modifications:  (1)  one  unit  of  total  work  is  charged  for  each 
I'fad  and  the  write  that  (potentially)  follows;  (2)  as  soon  as  a  processor  performs  a  read,  it  is 
'■barged  one  unit  work;  this  is  done  to  take  care  of  the  situation  when  a  processor  performs  a  write 
only  after  all  elements  in  a  given  segment  have  been  initialized.  □ 
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3.2  Lower  bounds  with  test-and-set  operations 

Under  certain  assumptions  on  the  way  that  memory  is  accessed  in  the  strongly  asynchronous  model, 
we  can  prove  a  different  lower  bound.  Assume  for  the  moment  that,  instead  of  atomic  reads  and 
writes,  memory  is  accessed  by  means  of  <est-and-se<  operations.  That  is,  memory  can  only  contain 
zeroes  and  ones,  and  a  single  test-and-set  operation  on  a  memory  cell  sets  the  value  of  that  cell 
to  1  and  returns  the  old  value  of  the  cell.  (We  will  discuss  shortly  how  this  assumption  can  be 
generalized.) 

Theorem  3.5  Any  strongly  asynchronous  PRAM  algorithm  for  the  Write-All  problem  which  uses 
test-and-set  as  an  atomic  operation  requires  N  -f  ft(Plog(W/P))  total  work,  for  P  >  3. 

Proof:  Consider  the  following  class  of  interleavings.  A  round  will  be  a  length  of  time  in  which 
processors  take  one  step  each  in  PID  order;  formally,  it  is  the  sequence  of  PIDs  (1,2,  ...P).  We 
will  run  the  algorithm  in  phases.  To  define  a  phase,  suppose  that  U  cells  out  of  the  original  N 
remain  unset  at  the  beginning  of  a  phase.  We  imagine  running  the  algorithm  in  rounds  until  a 
co//fsfon  occurs;  that  is,  until  a  test-and-set  operation  is  done  on  a  cell  that  is  already  set  to  one. 
Suppose  this  happens  in  the  tth  round.  The  actual  definition  of  the  phase  depends  on  the  nature 
of  the  collision;  there  are  two  cases. 

If  the  cell  involved  in  the  collision  was  set  in  this  round,  then  it  was  initially  set  by  some 
processor  with  PID  i,  and  set  again  by  some  processor  with  PID  j.  Then  to  define  the  phase,  we  let 
only  processors  i  and  j  alternate  steps,  instead  of  running  all  processors;  that  is,  the  phase  consists 
of  the  PIDs  i,j  repeated  t  times.  A  total  of  2t  steps  are  taken  and  one  of  them  is  wasted  work. 

On  the  other  hand,  if  the  cell  was  set  in  a  previous  round,  then  consider  the  processor  with  PID 
j  that  set  it  in  this  round  and  let  only  this  processor  take  steps.  That  is,  the  phase  consists  of  the 
PID  j  repeated  t  times,  for  a  total  of  t  steps  and  one  wasted  step. 

We  now  note  that  t  must  be  at  most  \U/P],  and  so  a  recurrence  for  the  amount  of  wasted  work 
IF(f^)  is  W{U)  >  1  -f  W{U  -  2\U/P]  1).  By  induction,  we  can  show  that  W{U)  >  cP\n{U I2P) 

for  a  suitable  constant  c  >  0;  the  result  follows  by  noting  that  unwasted  work  N  is  necessary. 

The  trivial  base  case  of  the  induction  is  U  <  2P.  Now  suppose  that  the  inequality  l'P'(i)  > 
rP\\\{xl2P)  holds  for  all  integer  x  <  17.  By  the  induction  hypothesis,  we  have  W{U)  >  cPln((f7- 
2ff'7P]  +  i)l‘2-P)  >  1  -l-cPln(I7/2P)  -l-cPlnfl  -  2/P  -  l/P).  It  thus  suffices  to  prove  1  -}-cPln(l  - 
2/P  -  l/P)  >  0.  But 

1  -t-  cPln(l  -  2/P  -  IfU)  >  1  +  cPln(l  -  5/(2P))  >  1  -f  cP(-.5/(2P  -  .5))  >  0. 

I'hc  first  inequality  is  valid  because  U  >  2P;  the  second  inequality  uses  ln(l  —  z)  >  -x/(i  -  z), 
which  can  be  seen  by  comparing  power  series;  the  third  inequality  is  valid  for  P  >  3  and  any  choice 
of  c  <  1/15.  No  attempt  Wcis  made  to  optimize  the  constant  c.  □ 

The  argument  used  in  this  lower  bound  can  be  applied  equally  well  if  the  atomic  operation  is 
'•ompare-and-swap,  or  to  any  set  of  atomic  read-modify-write  operations  where  the  read  and  writes 
aro  constrained  to  be  to  the  same  cells.  It  also  applies  to  atomic  read  and  atomic  write,  but  in  this 
case  there  is  no  known  matching  upper  bound,  whereas  algorithm  T  (presented  in  the  ne.xt  section) 
can  match  the  lower  bound  (for  some  choices  of  atomic  operation)  in  the  case  P  =  3.  The  above 
proof  technique  also  applies  to  the  fail-stop  restartable  model,  when  each  update  cycle  accesses 
only  one  array  element  used  by  the  Write- All  problem. 
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4  Algorithms  for  the  Write-All  problem 

The  original  motivation  for  studying  the  Write-All  problem  was  that  it  intuitively  captured  the 
('ssential  nature  of  a  single  synchronous  PRAM  step.  This  intuition  was  made  concrete  when  it  was 
shown  ([KPS  90,  Shv  89])  how  to  use  any  algorithm  for  the  Write-All  problem  in  general  PRAM 
simulations.  This  application  is  discussed  in  the  next  section;  in  this  section,  we  will  present  new 
algorithms  for  the  IKriie-A//  problem. 

In  what  follows,  we  assume  that  the  number  of  array  elements  N  and  the  number  of  processors 
P  are  powers  of  2.  Nonpowers  of  2  can  be  handled  using  conventional  padding  techniques.  AU 
logarithms  are  base  2. 


4.1  Algorithm  Vx  a  modification  of  a  no-restart  algorithm 

Algorithm  W  of  [KS  89]  is  an  efficient  fail-stop  (no  restart)  fPrfie-A//  solution.  Tlie  algorithm  uses 
two  full  binary  trees  as  its  basic  data  structures  (the  processor  counting  and  the  progress  measure- 
niont  trees).  The  algorithm  uses  an  iterative  approach  in  which  all  active  processors  synchronously 
execute  the  following  four  phases: 

\VT:  Processors  are  counted  and  enumerated  using  a  static  bottom-up,  logarithmic  time  traversal 
of  the  processor  counting  tree  data  structure. 

\V2:  Processors  are  allocated  to  the  unvisited  array  locations  according  to  a  divide-and-conquer 
strategy  using  a  dynamic  top-down  traversal  of  the  progress  tree  data  structure. 

\V3:  Array  assignments  are  done. 

\V4:  Progress  is  evaluated  by  a  dynamic  bottom-up  traversal  of  the  progress  tree  data  structure. 

This  algorithm  has  efficient  completed  work  when  subjected  to  arbitrary  failure  patterns  without 
rf’cfarts.  It  can  be  extended  to  handle  processor  restarts  by  introducing  an  iteration  counter,  and 
having  the  revived  processors  wait  for  the  start  of  a  new  iteration.  However,  this  algorithm  may 
not  terminate  if  the  adversary  does  not  allow  any  of  the  processors  that  were  alive  at  the  beginning 
of  an  iteration  to  complete  that  iteration.  Even  if  the  extended  algorithm  were  to  terminate,  its 
completed  work  is  not  bounded  by  a  function  of  N  and  P. 

In  addition,  the  proof  framework  of  [KS  89]  does  not  easily  extend  to  include  processor  restarts: 

'  he  processor  enumeration  and  allocation  phases  become  inefficient  and  possibly  incorrect,  since  no 
arnirate  estimates  of  active  processors  can  be  obtained  when  the  adversary  can  revive  any  of  the 
failed  processors  at  any  time. 

On  the  other  hand,  the  second  phase  of  algorithm  IP  can  implement  processor  assignment  (in  a 
'iianncr  similar  to  that  used  in  the  proof  of  Theorem  3.3)  in  O(logiV)  time  by  using  the  permanent 
procossor  PID  in  the  top-dowii  divide-and-conquer  allocation.  This  also  suggests  that  the  processor 
<  "fimeration  phase  of  algorithm  W  does  not  improve  its  efficiency  ivhen  processors  can  be  restarted. 

Therefore  we  present  a  modified  version  of  algorithm  W,  that  we  call  I'.  To  avoid  a  complete 
rcftialement  of  the  details  of  algorithm  V.  the  reader  is  urged  to  refer  to  [KS  89]. 

r  uses  the  data  structures  of  the  optimized  algorithm  W  of  [KS  89]  (i.e.,  full  binary  trees 
with  leaves)  for  progress  estimation  and  processor  allocation.  There  are  log  A'  array  elements 
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associated  with  each  leaf.  When  using/P  processor  such  that  P  >  on  such  data  structures,  it 
is  sufficient  for  each  processor  to  takedts  RIO  modulo  to  assure  that  there  is  a  unifdrmdhitial 
assignment  of  at  least  and;no  more  than  fP/  processors  to  a  work  element. 

.Algorithm  V  is  an  iterative  algorithm  using  the  following  three  phases. 

VI:  Allocate  processors  using  PIDs  in  a  dynamic  top-down  traversal  of  the  progress  tree  to  assure 
load  balancing  (O(logiV)  time). 

V2:  The  processors  now  perform  \york  at  the  leaves  they  reached  in  phase  VI  (there  are  log 
array  elements  per  leaf). 

V3:  The  processors  begin  at  the  leaves  of  the  progress  tree  where  they  ended  phase  V2  and  update 
the  progress  tree  dynamically,  bottom  up  (O(logjV)  time). 

Processor  re-synchronization  after  aiailure  and  a  restart  is  an  important  implementation  detail. 
One  way  of  realizing  processor  re-synchronization  is  through  the  utilization  of  an  iteration  wrap¬ 
around  counter  that  is  based  on  the  synchronous  PRAM  clock.  If  a  processor  fails,  and  then  is 
restarted,  it  waits  for  the  counter  wrap-around  to  rejoin  the  computation.  The  point  at  which  the 
counter  wraps  around  depends  on  the  length  of  the  program  code,  but  it  is  fixed  at  “compile  time”  . 

Analysis  of  algorithm  V : 

We  now  analyze  the  performance  of  this  algorithm  first  in  the  fail-stop,  and  then  in  the  fail-stop 
and  restart  setting. 

Lemma  4.1  The  completed  work  of  algorithm  V  using  P  <  N  processors  that  are  subject  to 
fail-stop  errors  without  restarts  is  5  =  0{N  -f  Plog^  N). 

Proof:  We  factor  out  any  work  that  is  wasted  due  to  failures  by  charging  this  work  to  the  failures, 
'^ince  the  failures  are  fail-stop,  there  can  be  at  most  P  failures,  and  each  processor  that  fails  can 
waste  at  most  O(logA)  steps  corresponding  to  a  single  iteration  of  tl.e  algorithm.  Therefore  the 
work  charged  to  the  failures  is  O(PlOgjV)^  and  it  will  be  absorbed  by  the  rest  of  the  work. 

We  next  evaluate  the  work  that  directly  contributes  to  the  progress  of  the  algorithm  by  distin¬ 
guishing  two  cases  below.  lu  each  of  the  cases,  it  takes  <^(logi^^)  =  O(logiV)  time  to  perform 
processor  allocation,  and  O(logA)  time  to  perform  the  work  at  the  leaves.  Thus  each  iteration  of 
file  algorithm  takes  0(log  A)  time.  We  use  the  allocation  technique  of  Theorem  3.3,  where  instead 
of  reading  and  locally  processing  the  entire  memory  at  unit  cost,  we  use  an  O(log  A')  time  iteration 
for  processor  allocation. 

f'ase  I:  I  <  P  <  In  this  case,  at  most  1  processor  is  initially  allocated  to  each  leaf.  .4s  in 

the  proof  of  Theorem  3.3,  when  the  first  -  P  leaves  are  visited,  there  is  no  more  than  one 
processor  allocated  to  each  leaf  by  the  balanced  allocation  phase.  When  the  remaining  P  dr  less 
loaves  are  visited,  the  work  is  O(PlogP)  by  Theorem  3.3  (not  counting  processor  allocation).  Each 
leaf  visit  takes  0(log  A)  work  steps;  therefore  the  completed  work  is: 

.5’  =  0  ((j^  -  P  +  PlogP)  •  log  a)  =  0(A  -f  P  logP  log  A)  =  0(N  -{-  Plog*  A). 
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Case  2:  <  P  <  N.  In  this  case,  no  more  than  \P/\^^]  processors  are  initially  allocated  to 

each  leaf.  Any  two  processors  that  are  initially  allocated  to  the  same  leaf,  should  they  both  survive, 
will  behave  identically  throughout  the  computation.  Therefore  we  can  use  Theorem  3.3  with  the 
\P/  [3^1  processor  allocation  as  a  multiplicative  factor.  From  this,  the  completed  work  is: 

"  =  "  (toP  Wn)  ■ 

The  results  of  the  two  cases  combine  to  yield  S  =  0{N  -f  Plog'iV).  □ 

The  following  theorem  expresses  the  completed  work  of  the  algorithnt  in  the  presence  of  restarts: 

Theorem  4.2  The  completed  work  of  algorithm  V  using  P  <  N  processors  subject  to  an  arbitrary 
failure  and  restart  pattern  F  of  size  M  is:  S  =  0{N  -1-  Plog^  N  -f  M  logiV). 

Proof;  The  proof  of  Lemma  4.1  does  not  rely  on  the  fact  that  in  the  absence  of  restarts,  the 
number  of  active  processors  is  non-increasing.  However,  the  lemma  does  not  account  for  lhe  v/ork 
that  might  be  performed  by  processors  that  are  active  during  a  part  of  an  iteration  but  do  not 
contribute  to  the  progress  of  the  algorithm  due  to  failures.  To  account  for  all  work,  we  are  going  to 
<'harge  to  the  array  being  processed  the  work  that  contributes  to  progress,  and  any  work  that  was 
wasted:due  to  failures  will  be  charged  to  the  failu  ss  and  restarts.  Lemma  4.1  accounts  for  the  work 
charged  to  the  array.  Otherwise,  we  observe  that  a  processor  can-waste  no  more  than:0(log  N)  time 
steps  without  contributing  to  the  progress  due  t(.  a  failure  and/or  a  restart.  Therefore  this  amount 
of  wasted  work  is  bounded  by  0(M!og  iV).  Th  s  proves  the  theorem.  (Note  that  the  completed 
work  5  of  V  is  small  for  small  [Fj,  but  not  boun''ed  by  a  function  of  P  and  N  for  large  |F|).  □ 


4.2  Algorithm  X:  a  binary  tree  algorithm 

We  present  a  new  algorithm  A  for  the  Write- All  problem,  and  show  that  its  completed/total  work 
coinple.xity  is  5  =  0(N  •  using  P  <  N  processors  in  the  restartable  fail-stop  and  the  strongly 

asynchronous  models  of  computation.  The  important  property  of  X  is  that  it  has  bounded  sub- 
f|uadratic  completed  work;  in  the  restartable  fail-stop  model,  this  is  independent  ul  the  failure 
pattern.  If  a  very  large  number  of  failures  occurs,  say  jF|  =  Q{N  •  then  the  algorithm’s 

overhead  ratio  a  becomes  optimal:  it  takcc  ?.  f'”cd  number  of  computing  steps  per  failure/recovery. 

Like  algorithm  V,  algorithm  X  utilizes  a  progi  tree  of  size  X,  but  it  is  traversed  by  the 
processors  independently,  not  in  synchronized  phases.  This  reflects  the  local  nature  of  the  processor 
assignment  in  algorithm  A  as  opposed  to  the  globa'  assignments  used  in  algorithms  V  and  W.  Each 
processor,  acting  independently,  searches  for  v.'ork  in  the  smallest  immediate  subtree  that  has  work 
that  needs  to  be  done.  It  then  performs  the  necessary  work,  and  moves  out  of  that  subtree  when 
no  more  work  remains.  VVe  present  the  algorithm  on  the  restartable  fail-stop  model. 

Input:  Shared  array  a;[l..iV);  a:(ij  =  0  for  1  <  i  < 

Output;  Shared  array  .-cfl../.');  2(i]  =  I  for  1  <  i  <  N . 

Data-structures:  The  algorithm  uses  a  full  binary  tree  of  size  2iV-l,  stored  as  a  heap  d(l . .  .2N  - 
ll  in  shared  memory.  An  internal  tree  node  (i(i]  (i  =  1, . . .,  A  -  L)  has  the  left  child  and  the 
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01  forall  processors  EID=0..P  -  1  parbegin 

02  Perform  initial  processor  assignment  toithe  leaves  of  the  progress  tree 

03  while  there  is  still  work  left  in  the  tree  do 

04  if  current  subtree  is  doi;"  '’.hen  move  one  level  up 

05  elseif  this  is  a  leaf  then  ,rm  the  work  at  the  leaf 

06  elseif  this  is  an  intevis*- .  >. ode  then 

07  if  both  subtrees  ..c  J-a-.'  '■Li-ti  update  the  tree  node 

08  elseif  only  one  is  --lo” ;  vhe.»  go  to  the  one  that  is  not  done 

09  else  move  to  the  •.clr/'ight  subtree  according  io  PTD  if/  values 

10  fi  ’ 

11  fi 

12  od 

13  parend 

i _ _ _ _ _ 

Figure  2:  A  high  level  view  of  the  algorithm  X. 

right  child  d[9i  -[- 1].  The  tree  is  used  for  progre';s  evaluation,  and  processor  allocation.  The  vadues 
stored  in  the  heap  are  initially  0. 

The  N  elements  of  the  input  array  x(l . .  ..V]  are  associated  with  the  leaves  of  the  tree.  Element 
.r[t)  is  associated  with  X- 1],  where  1  <  t  <  N.  The  algorithm  also  utilizes  an  array  n;(0..P- 1] 
that  is  used  to  store  individual  processor  locations  within  the  progress  tree  d. 

Each  processor  uses  some  constant  amount  of  private  memory  to  perform  simple  arithmetic 
computations.  An  important  private  constant  is  PID,  containing  the  initial  processor  identifier. 

Thus,  the  overall  memos  v  used  is  0{N  +  P)  and  the  data-structures  are  simple. 

Control-flow:  The  algorithm  consists  of  a  single  initialization  and  of  the  parallel  loop.  .4  high 
level  view  of  the  algorithm  is  in  Figure  2;  all  line  numbers  refer  to  this  figure.  More  detailed  code 
can  be  found  in  Appendix  A. 

The  initialization  (line  02)  assigns  the  P  processors  to  the  leaves  of  the  progres.?  tree  so  that  the 
l>iocessors  are  assigned  to  the  first  P  leaves  by  storing  the  initial  leaf  assignment  in  tnfPID].  The 
/on/;  (lines  03-12)  consists  of  a  multi-way  decision  (lines  0-1-11).  If  the  current  node  is  marked  done, 
the  processor  moves  up  the  tree  (line  04).  If  the  processor  is  at  a  leaf,  it  performs  work  (line  05).  If 
'he  current  node  is  an  unmarked  interior  node  and  both  of  its  subtrees  arc  done,  the  interior  node 
i"-  marked  by  changing  its  value  from  0  to  1  (line  07).  If  a  single  subtree  is  not  done,  the  processor 
moves  down  appropriately  (line  08). 

For  the  final  case  (line  09),  the  processors  move  down  when  neither  child  is  done.  This  last  case 
where  a  non-trivial  {italicized)  .lecision  is  made.  The  PID  of  the  processor  is  used  at  depth  h  of 
'Im  tree  node  based  on  the  value  of  the  /i‘^  most  significant  bit  of  the  binary  representation  of  the 
PID:  bit  0  will  send  the  processor  to  the.  left,  and  bit  1  to  the  right. 

Regardless  of  the  decision  made  by  a  processor  within  the  loop  body,  each  iteration  of  the  body 
con'^ists  of  no  more  than  four  shared  memory  reads,  a  fixed  time  computation  using  private  memory, 
and  one  shared  memory  write  (see  Appendi':  A  for  the  detailed  algorithm).  Therefore  the  body 
can  be  implemented  as  an  update  cycle. 


I 
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0  1  2  3  4  5  6  [7] 


Figure  3:  Processor  traversal  of  the  progress  tree. 

Exavnple  C:  Consider  algnr  ihm  X  (or  N  =  P  =  8.  The  progfuss  tree  d  of  size  2N  -  1  =  15 
is  used  to  represent  the  full  binary  progress  tree  with  eight  leavr--  The  8  processors  have  PIDs 
in  the  range  0  through  7.  Their  initial  positions  are  indicated  >n  Figure  3  under  the  lef-^es  of 
the  tree.  The  diagram  illustracea  the  state  of  a  computation  where  the  processors  were  subject 
to  some  failures  and  restarts.  Heavy  dots  indicate  nodes  whose  subtrees  are  finished.  The-paths 
lieing  traversed  by  the  processors  are  indicated  by  the  arrows.  Active  processor  locations  (at  the 
lime  when  the  snapshot  was  taken)  are  indicated  by  their  PIDs  in  brackets.  In  this  configuration, 
should  the  active  processors  complete  the  ne-\t  cycle,  they  will  move  in  the  directions  indicated  by 
the  arrows:  processors  0  and  1  will  descend  to  the  left  and  right  respectively,  processor  d  will.move 
to  the  unvisited  leaf  to  its  right,  and  processors  6  and  7  will  move  up.  □ 


Analysis  of  algorithm  .Y: 

Wf*  begin  by  showing  the  correct  ss  and  termination  of  algorithm  X  in  tjie  following  simple  lemma. 

Lemma  4.3  Algorithm  A'  with  >V  processors  is  a  correct,  terminating  and  fault -tolerant  solution 
for  the  P-processor  Write-All  problem  of  size  A'.  The  algorithm  terminates  in  at  least  n(logtV) 
and  at  most  0(P  •  iV)  time  steps. 

P*  oof:  We  first  observe  that  the  processor  loads  are  localized  in  the  sense  that  a  processor  exhausU 
all  'ork  in  the  vicinity  of  its  original  position  in  the  tree,  before  moving  to  other  areas  of  the  tree. 
If  f'  processor  moves  up  out  of  a  subtree  then  all  the  leaves  in  that  subtree  were  visited.  We  also 
observe  that  it  takes  e.xactly  one  update  cycle  to:  (i)  change  the  value  of  a  progress  tree  node  from 
0  to  1,  (ii)  to  move  up  from  a  (non  rcot)  node,  or  (iii)  to  move  down  left,  or  (iv)  down  right  from 
a  t  non  leaf)  node.  Therefore,  given  any  node  of  the  progress  tree  and  any  processor,  the  processor 
will  visit  and  spend  exactly  one  complete  update  cycle  at  the  node  no  more  than  four  times. 

Since  there  are  2N  1  nodes  in  the  progress  tree,  any  processor  will  be  able  to  e.xecutc  no  more 
'lian  0(A^)=compieted  update  cycles.  If  there  are  P  processors,  then  all  processors  wil.  be  able  to 
roinplcte  no  more  than  0{P  •  N)  update  cycles.  Furthermore,  at  any  point  in  time,  there  is  at 
l»'ast  one  update  cycle  tliat  will  complete.  Therefore  it  will  take  no  more  than  0{P  ■  A')  sequential 
update  cycles  of  constant  size  for  the  algorithm  to  terminate. 

Finally,  we  also  observe  that  all  p.aths  from  a  leaf  to  the  root  are  at  least  log  A'^  long,  therefore 
r\f  least  logiV  update  cycles  per  processor  will  he  required  for  the  algorithm  to  terminate.  □ 


d:  ALGORITHMS  FOR  THE  Vf  RITE- ALL  PROBLEM:  18 

Now  we  prove  the  main  work  lemma.  In  the  rest  of  this  section,  the  expression  ’  denotes 
the  completed^  work  on  inputs  of  size  N  using  P  initid  processors  and  for  any  failure  pattern.  Note 
that  in  this  lemma  we  assume  P  >  N. 


Lemma  4.4  The  completed  work  of  algorithm  X  for  the  IFrite-/.//  problem  of  size  iV  with  P  >  N 
initial  processors  and  for  any  pattern  of  failures  and  restarts  is  5,v.p  =  0(P-  .N*®*?). 


Proof:  We  show  by  induction  on  the  h.'ght  of  the  progress  tree  that  there  are  positive  constants 
ci,C2,C3  such  that  <  C\P  •  -  CiPlogW  —  czR. 

For  the  base  case:  we  have  a  tree  of  height  0  that  corresponds  to  an  input  array  of  size  1  and 
at  least  as  many  initial  processors  P.  Since  at  least  one  processor,  and  at  most  P  processors  will 
be  active,  this.single  leaf  will  be  visited  in.  a  constant  number  of  steps.  Let  the  work  expended  be 
r'P  for  some  constant  c'  that  depends  only  on  the  lexical  structure  of  the  algorithm.  Therefore 
5i.p  =  c'P  <  c\P  •  2  —  C2P  •  0  ~  C3P  when  ci  is  chosen  to  be  larger  than  or  equal  to  C3  +  c'. 

Now  consider  a  tree  of  height  lOgiV  (>  1).  The  root  has  two  subtrees  (left  and  right)  of  height 
logjV  -  1.  By  the  definition  of  dgoritl.ja  vV,  no  processor  will  leave  a  subtree  until  the  subtree 
is  marked-one,  i.e.,  the  value  of  ti  e  root  of  the  subtree  =is  changed  from  0  to  1.  We  consider  the 
following  sub-c^es:  (1)  both  subt.aes  are  marked-one  simultaneously,  and  (2)  one  of  the  subtrees 
is  marked-one  before  the  other. 


Case  1:  If  both  subtrees  are  marked-one  simultaneously^  then  the  algorithm  will  terminate  after 
the  two  independent  subtrees  terminate  plus  some  smcil  constant  number  of  steps  c'  (when  a 
processor  moves  to  the  root  and  determines  that  both  of  ihe  subtrees  are  finished).  Both  the  work 
5/,  expended  in  the  left  subtree  of,  and  the  work  Sn  in  the  right  subtree  are  bounded  by  SnJi2,p/2- 
The  added  work  needed  for  the  algorithm  to  terminate  is  at  most  c'P,  and  so  the  total  work  is: 


S  <  Sr,  -f  Sr  -b  c'P  <  25,NY2.r/2  +  c'P  <  2 


-fc'P 


=  c,  |PiV*®8  5  _  C2P  log  ^  -  C3P  -}-  c'P  <  c,  P  -  iV'®8  2  -  C2P  log  N  -  CzP 
for  sufficiently  large  cj  and  any  C2  depending  on  c',  e.g.,  C\  >  3(C2  t  c'). 

Case  2:  Assume  without  loss  of  generality  that  the  left  subtree  is  marked-one  first  with  Sj,  = 
^.\'/2.P/2  work  being  expended  in  this  subtree.  .Any  active  processors  from  the  left  subtree  will  start 
moving  via  the  root  to  the  right  subtree.  The  path  traversed  by  any  processor  as  it  moves  to  the 
right  subtree  after  the  left  subtree  is  finished  is  bounded  by  the  maximum  path  length  from  a  leaf 
'o  another  leaf  c'logiV  for  a  predefined  constant  c'  lUore  than  the  original  P/2  processors  of 
'■10  left  subtree  will  move,  and  so  the  work  of  moving  the  processors  is  bounded  by  c'{P/2)logiV. 

We  observe  that  the  cost  of  an  c.\ecution  in  which  P  processors  begin  at  the  leaves  of  a  tree 
f\*iih  N/2  leaves)  differs  from  the  cost  of  an  e.xecntion  where  P/2  processors  start  at  the  leaves, 
nnd  P/2  arrive  at  a  later  time  via  the  root,  by  no  more  than  the  cost  c'(P/2)logiV  accounted 
for  above.  This  can  be  simply  shown  by  constructing  a  scenario  in  which  the  second  set  of  P/2 
processors  do  not  arrive  through  the  root,  but  instead  start  their  execution  with  a  failure,  and  then 
'raversc  along  a  path  of  I’s  (if  any)  in  the  progress  tree,  until  they  reach  a  0  node  that  is  either  a 
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loaf,  orwhose  descendants  are  marked.  Having  accounted  for  the  difference,  we  see  that  the  work 
S{i  to  complete  the  right  subtree  using  up  to.P  proces.-.ors  is;  bounded  by  Si^/2,p  (by  the  definition 
of  5,  if  Pi  <  P2,  then  .5;v,p,  <  S/v.Pj)-  After  this,  each  processor  will  spend  some  constant  number 
ofsteps  moving  to  the  root  and  terminating  the  algorithm.  This  work  is  bounded  by  c"P  for  some 
small  constant  c".  The  total  work  .S  is: 

5  <  Si,  c'—  log  N  -f  Sr  A-  c"P  <  S,\-i2,pf2  +  c'—  log  jY  S}^i2,p  +  c"P 

P  P  N  P  P  M 

-  ‘^'  2  ("2  }  -  "4  T  “  "2  ( Y j  "  T  ■ 

=  ciPAr‘°8i  -  cjPlog  jV  (^1  -  ^  C3p;(|  “  C  "  §)  -  “  cjPlogW  -  C3P 

for  sufficiently  large  C2  and  C3  depending  on  fixed  c'  and  c",  e.g.,  C2  >  c'  and  C3  >  3c2  4-  2c". 

Since  the  constants  c',  c"  depend  only  on  the  le.xical  structure  of  the  algorithm,  the  constants 
ci.f2>C3  can  always  be  chosen  sufficiently  large  to  satisfy  the  base  case  and  both  the  cases  (1)  and 
(2)  of  the  inductive  step.  This  completes  the  proof  of  the  lemma.  □ 

The  quantity  P  •  5  is  about  P  •  We  next  show  a  particular  pattern  of  failures  for 

which  the  completed  work  of  algorithm  AT  matches  this  upper  bound. 

Lemma  4.5  There  exists  a  pattern  of  fail-stop/restart  errors  that  cause  the  algorithm  X  to  per¬ 
form  S  =  work  on  the  input  of  size  N  using  P  =  N  processors. 

Proof:  We  can  compute  the  exact  work  performed  by  the  algorithm  when  the  adversary  adheres 
to  tiic  following  strategy: 

In)  The  processor  with  PID  0  will  be  allowed  to  sequentially  traverse  the  progress  tree  in  post-order 
•siarting  at  the  leftmost  leaf  and  finishing  at  the  rightmost  le.af. 

fl>)  The  processors  that  find  themselves  at  the  same  leaf  as  processor  0  are  (re)started  and  are 
allowed  to  traverse  the  progress  tree  until  they  reach  a  leaf,  where  they  are  failed. 

(c)  Procedure  (b)  is  repeated  until  all  leaves  arc  visited. 

Thus  the  leaves  of  the  progress  tree  arc  visited  left  to  right,  from  the  leaf  number  1  to  the  leaf 
number  N.  .At  any  time,  if  i  is  the  number  of  the  rightmost  visited  leaf,  then  only  the  processors 
with  PIDs  0  to  /  -  1  have  performed  at  least  one  update  cycle  thus  far. 

The  cost  of  such  strategy  can  be  expressed  inductively  as  follows: 

The  cost  Cj  of  traversing  a  tree  of  size  1  using  a. single  processor  is  1  (unit  of  completed  work). 

The  cost  Ci+i  of  traversing  a  tree  of  size  2''*"‘  is  computed  as  follows:  first,  there  is  the  cost  C,  of 
I  raversing  tiie  left  subtree  of  size  2'.  Then,  all  processors  move  to  the  right  subtree  and  participate 
f subject  to  failures)  in  the  traversal  of  the  right  subtree  at  the  cost  of  2C,  —  the  cost  is  doubled, 
linrausc  the  two  processors  whose  PIDs  arc  equal  modulo  i  behave  identically.  Thus  =  3Ci, 
and  CiogiV  =  =  ;VI°83.  o 

Now  we  show  how  to  use  algorithm  X  with  P  processors  to  solve  Wrilc-All  problems  of  size  N 
'iirii  that  P  <  .V.  Given  an  array  of  size  ;V.  we  break  the  N  elements  of  the  input  into  ^  groups 
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of  P  elements  each  (the  last  group  may ^have  fewer  than  P  elements).  The  P  processors  are  then 
used  to  solve  ^  Write- All  problems  oLsize  P  one  at  a  time.  We  call  this  algorithm  X',  and^we  will 
use  X'  in  the  general  simulations. 

Remark:  Strictly  speaking,  it  is  not  necessary  to  modify  algorithm  A'  for  P  <  N  processors. 
Algorithm  X  can  be  used  with  P  <  N  processors  by  initially  assigning  the  P  processors  to  the 
first  P  elements  of  the  array  to  be  visited.  It  can  also  be  shown  that  A'  and  A''  have  the  same 
asymptotic  complexity;  however,  the  analysis  of  A''  is  very  simple,  as  we  show  below. 


Theoremr4.6  Algorithm  X*  with  P  processors  solves  the  Write- All  problem  of  size  N  (P  <  N) 
using  completed  work  S  =  G\^N  •  P’°*2  ).  In  addition,  there  is  an  adversary  that  forces  algorithm 
X'  to  perforin  S  =  Sl{N  •  P*®®?)  work. 

Proof:  By  Lemma  4.4,  5p,p  =  0(P  •  P'°*2)  =  0(P*°®^).  Thus  the  overall  work  will  be  5  = 
O(^Sp.p)  ==0(^P'°83)  =  OiN  ■  P'°«l). 

Using  the  strategy  of  Lemma  4.5, ;ah  adversary  causes  the  algorithm  to  perform  work  Sp,p  = 
Q(P'°8^)  on  each  of  the  ^  segments  of  the  input  array.  This  results  in  the  overall  work  of:  S  = 

Q(^plog3)  _  .  plog|)_  □ 

Remark:  Lemma  4.3  gives  only  a  lobse.upper  bound  for  the  wofst.time  performance  of  algorithm 
.Y  —  there  we  are  primarily  concerned  with  termination.  The  actual  worst  case  time  for  algorithm 
.V  can  be  no  more  than  the  upper  bound  on  the  completed  work.  This  is  because  at  any  point  in 
lime  there  is  at  least  one  update  cycle  that  will  complete.  Therefore,  for  algorithm  A''  with  P  <  N, 
the  time  is  bounded  by  0(jV  ■  P’°8 5).  In  particular,  for  P  =  N,  the  time  is  bounded  by  0(iV*°83). 
In  fact,  using  the  worst  case  strategy  of  Lemma  4.5,  an  adversary  can  “time  share"  the  completed 
rycles  of  the  processors  so  only  one  processor  is  active  at  any  given  time,  with  the  processor  with 
FID  0  being  one  step  ahead  of  other  processors.  The  resulting  time  is  then 

In  algorithm  X,  processors  work  for  the  most  part  independently  of  other  processors;  they 
attompt  to  avoid  duplicating  already-completed  work  but  do  not  co-ordinate  their  actions  with 
other  processors.  It  is  this  property  which  allows  the  algorithm  to  run  on  the  strongly  asynchronous 
model  with  the  same  work  and  time  bounds. 

Lemma  4.7  .Algorithm  .Y  with  P  processors  solves  the  Write-All  problem  of  size  X  (P  >  N)  on 
the  strongly  asynchronous  model  with  total  work  0(P  • 

Proof:  If  we  let  5,v,p  be  the  total  work  done  by  algorithm  A'  on  a  problem  of  size  .V  with  P 
processors,  then  5,v.p  satisfies  the  same  recurrence  as  given  in  the  proof  of  Lemma  4.4.  The  proof, 
which  never  uses  synchroneity,  goes  through  exactly  as  in  that  lemma,  except  that  case  1  (where 
left  and  right  subtrees  have  their  roots  marked  simultaneously)  does  not  occur.  □ 

The  final  result  of  this  section  is  similar  to  Theorem  4.6: 

Theorem  4.8  .Algorithm  ,Y'  with  P  processors  solves  the  Write-All  problem  of  size  ;V  (P  <  A') 
on  the  strongly  asynchronous  model  with  total  work  0{iV  -  P’®®?). 
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4.3  Algorithm  T:  a  three-processor  algorithm 

Quite  difierent  techniques  are  necessary  when  designing  a  parallel  algorithm  in  which  the  number 
of  processors  is  much  smaller  than  the  size  of  the  input.  The  goal  in  this  situation,  when  the 
underlying  machine  is  synchronous,  is  to  find  a  method  whose  parallel  time  complc.'city  is  at  most 
the  sequential  time  comple-xity  divided  by  the  number  of  processors  plus  a  small  additzue  overhead; 
see  (.*\nd  90]  for  an  e,\ample  of  such  an  algorithm.  Note  that  constant  factors  are  important  and 
f-annot  be  hidden  in  0-notation.  When  considering  algorithms  on  fail-stop  or  asynchronous  models, 
ihe  goal  is  to  have  the  parallel  work  complexity  be  equal  to  the  sequential  comple.xity  plus  small 
overhead. 

For  the  Write- All  problem,  it  is  easy  to  achieve  this  goal  with  two  processors.  The  processor 
with  PID  0  (henceforth,  Po)  reads  and  then  writes  locations  sequentially  starting  at  1  and  moving 
up:  processor  Pi  reads  and  then  writes  locations  sequentially  starting  at  N  and  moving  down.  Both 
processors  stop  when  they  read  a  1.  The  completed  work  is  exactly  A'  -r  1. 

The  first  non-trivial  case  is  that  of  three  processors.  Here  is  an  intuitive  description  of  an 
algorithm  that  works  in  this  situation.  Processor  Po  works  Icft-to-right,  processor  Pi  works  right- 
to-left,  and  P2  fills  starting  from  the  middle  and  alternately  expanding  in  both  directions.  If  Po 
and  P2  meet,  they  both  know  that  an  entire  prefix  of  the  memory  cells  has  been  written.  Processor 
Po  then  jumps  to  the  leftmost  cell  not  written  by  itself  or  pj,  and  P-y  Jumps  to  the  new  “middle” 
of  unwritten  cells,  A  meeting  of  Pi  and  P2  is  symmetric.  When  Po  and  Pi  meet,  the  computation 
is  complete.  Intuitively,  processors  can  maintain  an  upper  bound  on  the  number  of  empty  cells 
remaining  that  starts  at  N  and  is  halved  every  time  a  collision  occurs.  Thus  at  most  log  N  collisions 
are  experienced  by  each  processor.  High-level  pseudo-code  for  the  algorithm  is  given  in  figure  4. 

Implementation  of  the  high-level  algorithm  requires  some  form  of  communication  among  the 
asynchronous  processors.  .At  a  collision,  a  processor  must  determine  which  processor  previously 
wrote  the  cell.  In  the  case  of  a  collision  with  pj,  a  processor  must  also  determine  what  portion 
nf  the  array  to  jump  over.  This  communication  may  be  implemented  either  by  writing  additional 
information  to  the  cells  of  the  array  or  by  using  auxiliary  variables. 

If  the  array  in  which  processors  are  writing  is  also  used  to  hold  auxiliary  information,  imple- 
iiieniation  is  straightforward.  When  processor  Pj  writes  to  a  cell  at  the  left  (resp.  right)  end  of  its 
area,  it  writes  the  location  of  the  next  unwritten  cell  to  the  right  (resp.  left).  Po  and  Pi  write  the 
values  —I  and  N  -r  1  respectively,  to  signal  no  unwritten  cells.  A  total  of  A’  -r  0(log  A')  reads  and 
V  —  0(Iog  AQ  writes  are  required  on  the  asynchronous  model.  If  an  atomic  comptire-and-swap  is 
used,  the  total  work  is  reduced  to  N  0(log  A')  operations. 

To  solve  the  pure  Wrile-All  problem,  in  which  only  I's  are  written  to  the  array,  auxiliary  shared 
variables  arc  required.  These  variables  must  be  carefully  managed  to  ensure  that  the  processors 
maintain  a  consistent  view  of  the  progress  of  the  algorithm.  Because  a  procc.ssor  nuay  be  delayed 
liniween  reading  an  auxiliary  variable  and  writing  to  the  array,  complete  consistency  is  impossible. 
Approximate  consistency  is  sufficient,  however,  if  the  processors  arc  appropriately  pessimistic.  The 
precise  code  is  presented  and  analyzed  in  .Appendix  B. 

In  summary,  algorithm  T  provides  the  following  bounds. 

Theorem  4.9  The  Write-All  problem  for  three  procc.ssors  can  be  solved  with  A' -rOflog  A')  writes 
io  and  N  -r  Oflog  AQ  reads  from  the  array. 


Figure  -1:  A  high-level  description  of  algoritlun  T.  Processor  P{  executes  Ti- 


In  most  applications,  the  array  also  has  room  for  communication  variables,  and  no  auxiliary  vari¬ 
ables  are  necessary. 

5  General  simulations  on  restartable  fail-stop  processors 

We  now  present  a  major  extension  to  the  algorithms  presented  so  far.  This  is  an  eflicient  dctcr- 
tninistic  simulation  of  any  .Y  processor  synchronous  PIL\M  on  P  restartable  fail-stop  processors 
'  P  £  ^olo  that  due  to  the  impossibility  results  for  asynchronous  models  [Her  .SS),  we  arc  able 
lo  show  this  result  only  for  the  restartable  fail-stop  model. 

We  first  formally  state  the  main  result  and  then  discuss  its  proof. 
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Theorem  5.1  Any  iV-processor  PRAM  algorithm  can  be  executed  on  a  restartable  fail-stop  P- 
processor  CRCW  PRAM,  with  P  <  N.  Each  A-processor  PRAM  step  is  executedHn  the  presence 
of  any  pattern  F  oi  failures  and  restarts  of  size  M  with: 

•  completed  work:  S  =  0(min{A  -1-  P\og^  N  -{-  M  logN,  N  • 

•  overhead  ratio:  a  =  0(log^  N). 

EREW,  CREW,  and  weak  and  common  CRGW  PRAM  algorithms  are  simulated  on  fail-stop 
COMMON  CRCW  FRAMs;  Arbitrary  and  strong  CRCW  PRAMs  are  simulated  on  fail-stop 
CRCW  PRAMs  of  the  same  type.  □ 

Remark:  Priority  CRCW  PRAMs  cannot  be  directly  simulated  using  the  same  framework, 
for  one  of  the  algorithms  used  (namely  algorithm  X  in  Section  4)  does  not  possess  the  processor 
allocation  monotonicity  property  that  assures  that  higher  numbered  processors  simulate  the  steps 
of  the  higher  numbered  original  processors. 

An  approach  for  executing  arbitrary  PRAM  programs  on  fail-stop  CRCW  PRAMs  (without 
restart)  was  presented  independently  in  [KPS  90]  and  [Shv  89).  The  execution  is  based  on  simu¬ 
lating  individual  PRAM  computation  steps  using  the  Write-All  paradigm.  It  was  shown  that  the 
complexity  of  solving  a  A^-size  instance  of  the  Write-All  problem  using  P  fail-stop  processors  is 
equal,  to  the  complexity  of  executing  a  single  A^-processor  PRAM  step  on  a  fail-stop  P-processor 
PRAM.  Here  we  describe  how  algorithms  V  and  X'  are  combined  with  the  framework  of  [KPS  90]  of 
[Shv  89]  to  yield  efficient  executions  of  PRAM  programs  on  PRAMs  that  are  subject  to  stop-failures 
and  restarts  as  stated  in  Theorem  5.1. 

Theorem  5.2  There  exists  a  Write-All  solution  using  P  <  N  processors  on  instances  of  size 
N  such  that  for  any  pattern  F  of  failures  and  restarts  with  |P|  <  M,  the  completed  work  is 
S  =  0(min{A'  -}-  Plog^  N  -f  M  log  A,  N  •  P'°®  ?}),  and  the  overhead  ratio  is  cr  =  0{log^  N)  . 

Proof:  The  executions  of  algorithms  V  and  X'  can  be  interleaved  to  yield  an  algorithm  that 
achieves  the  performance  as  stated.  The  completed  work  complexity  is  asymptotically  equal  to 
t  he  minimum  of  the  coihpleted  work  performed  by  V  and  X'.  This  is  because  the  number  of 
rycles  performed  by  each-algorithm  in  the  interleaving: differs  by  at  most  a  multiplicative  constant. 
The  overhead  ratio  is  directly  inherited  from  algorithm  V  by  the  same  reasoning  because  of  the 
Definition  2.3  of  a  and  5.  □ 

The  simulations  of  the  individual  PRAM  steps  are  based  on  replacing  the  trivial  array  assign¬ 
ments  in  a  Write- All  solution  with  the  appropriate  components  of  thePRAM  steps.  These  steps  are 
decomposed  into  a  fixed  number  of  assignments  corresponding  to  the  standard  fetch/decode/execute 
R.A.1VI  instruction  cycles  in  which  the  data  words  are  moved  between  the  shared  memory  and  the 
internal  processor  registers.  The  resulting  algorithm  is  then  used  to  interpret  the  individual  cycles 
using  the  available  fail-stop  processors  and  to  ensure  that  the  results  of  computations  are  stored 
in  temporary  memory  before  simulating  the  synchronous  updates  of  the  shared  memory  with  the 
new  values.  For  the  details  on  this  technique,  the  reader  is  referred  to  [KS  89,  KPS  90,  Shv  89]. 
\pplication  of  these  techniques  in  conjunction  with  the  algorithms  V  and  X'  yield  efficient  and 
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terminating  executions  of  any  non-fault-tolerant  BEAM  programs  in  the:presence  of  arbitrary  fail¬ 
ure  and  restart  patterns.  Theorem  5.1  follows  from  Theorem  5.2  and  the  results  of  [KPS  QO]  or 
[Shv  89]'.  The  following  core  'es  are  also  interesting: 

Corollary  5.3  Under  the  hypothesis  of  Theorem  5.1,  and  if  |i^|  <  P  <  iV,  then: 

5  =  Q{N  +  Plog2  A),  and  a  =  0(log2  .Y). 

The  fail-stop  (without  restarts)  behavior  of  the  combined  algorithm  is  subsumed  by  this. corol¬ 
lary.  The  exact  analysis  of  algorithm  V  without  restarts  is  still  unknown.  Without  restarts, 
[KPRStOO]  have  an  algorithm  with  S  =  0{N  +  -Pjogb^))  Oljrhas  shown  that  thessame 

performance  is  achieved  by  algorithm  W  from  [KS-89]. 

Corollary  5.4  Under  the  hypothesis  of  Theorem  5.1: 

•  when  |P|  is  n(AlogW),  then  a  is  O(logA’), 

•  when  |P|  is  Q,{N^-^^),  then  cr  is  0(1). 

Thusithe  overhead  efficiency  of  our  algorithm  actually  improves  for  large  failure  patterns.  These 
results  also  suggest  that  it  is  harder  to  deal  efficiently  with  a  few  worst  case  failures  than  with  a 
large  number  pf  failures. 

Our  -hext  corollary  demonstrates  a  non-triviah  range  of  parameters  for  which  the  completed 
work  is  optimal;  i.e.,  the  work  performed  in  executing  a  parallel  algorithm  on  a  faulty  PRAM  is 
asymptotically  equal  to  the  Pdrallel-timex  Processors  product  for  that  algorithm. 

Corollary  5.5  Any  A-processor,  r-time  PRAM  algorithm  can  be  executed  on  a  P  <  A/  log^  N 
processor  fail-stop  CRCW  PRAM,  such  that  when  during  the  e.xecutibn  of  each  A-processof  step 
of  that  algorithm  the  total  number  of  processor  failures  and  restarts  is  0{N/\ogN),  then  the 
completed  work  is  S  =  0(r  ■  N). 


6  Discussion  and  Open  Probrem& 

Wo  conclude  with  a  brief  discussion  of  open  problems  and  the  effects  of  on-line  adversaries  on  the 
expected  performance  of  randomized  algorithms. 

Lower  Bounds:  We  have  shown  an  fi(.Ylog  A),  lower  bound  (when  A  =  P)  for  the  IVrite- 
\ll  problem  in  both  the  restartable  fail-stop  and  the  strongly  asynchronous  models  under  the 
Mssumptiph  that  processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost. 
Under  this  assumption,  these  are  the  best  possible  lower  bounds. 

Under  the  same  assumption,  it  can  be  shown  that  the  f2(Alog  A/loglog  A)  lower  bound  of 
[KS  89]  is  the  best  possible  bound  for  failures  without  restarts.  This  is  done  by  adapting  the 
nnalysis  Pf  algorithm  W  by  [Mar  91].  According  to  the  analysis,  the  nuihber  of  “block-steps”  of  W 
for  P  =  A  is  D(Alog  A/loglog  A)  and  each  block-step  can  be  realized;  at  unit  cost  each,  under 
the  above  assumptions. 
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Under  different  cissumptions,  an  ^(iVlogA^)  lower  bound  is  shown  for  failures  without  restarts 
in  [KPRS  90]. 

Can  these  lower  bounds  be  further  improved?  Can  the  lower  bound  of  N +fi(P  log  N)  be  proved 
for  the  restartable  fail-stop  model,  or  improved  for  the  strongly  asynchronous  model  with  atomic 
reads  and  writes? 

Upper  bounds;  Is  C)(iVlog‘^(^^  N)  completed/total  work  for  solving  Write-All  with  iV  processors 
and  input  of  size  ^V  achievable  in  the  restartable  fail-stop/strongly  asynchronous  model?  Recently, 
an  existence  proof  for  an  algorithm  achieving  work  was  given  in  [AW  91]. 

What  is  the  worst-case  completed  work  5,  and  overhead  ratio  a  of  the  algorithm  X  in  the 
fail-stop  (without  restart)  framework  of  [KS  89]?  Algorithm  X  appears  to  perform  well  in  this 
context.  For  example,  the  adversary  used  to  show  the  lower  bound  in  [KS  89]  causes  completed 
work  S  =  0(Alog^  iV/loglogW)  for  the  A^-processor  Write-All  solution  in  [KS;89].  The  same 
adversary  causes  algorithm  X  to  do  completed  work  S  =  0 (W  log  yV  log  log  A/ log  log  log  iV).  We 
conjecture  that  the  fail-stop  (no  restart)  performance  of  X  has  S  =  0(A  log  A  log  log  A)  using  A 
processors. 

Can  algorithm  T  be  generalized  to  work  with  more  than  three  processors,  or  can  another  (more 
general)  algorithm  be  found  that  achieves  truly  optimal  speedup  for  small  numbers  of  processors? 

Model  issues:  What  is  the  minimum  number  of  reads  and  writes  necessary  in  an  updaite  cycle  to 
ensure  efficient  algorithms?  What  js  the  precise  relationship  between  the  complexity  of  problems 
(as  opposed  to  algorithms)  on  the  two  models  presented  here?  Finally,  are  there  efficient  algorithms 
for  important  problems  that  do  not  come  from  simulation  of  synchronous  PRAM  algorithms? 

On  randomization  and  lower  bounds:  Analyses  of  randomized  solutions  for  Write-All  have  so 
far  considered  only  off-line  (aon-^duptive)  adversaries.  In  contrrist,  the  lower  bounds  of  Section  3 
apply  to  both  the  worst  case  performance  of  deterministic  algorithms  and  the  expected  performance 
of  randomized  algorithms  subject  to  on-line  adversaries. 

A  randomized  asynchronous  coupon  clipping  (ACC)  algorithm  for  the  Write-All  problem  was 
analyzed  in  [MSP  90].  Assuming  off-line  adversaries,  it  was  shown  in  [MSP  90]  that  ACC  algorithm 
performs  expected  0(A)  work  using  P  =  A/(log  Alog*  A)  processors  on  inputs  of  size  A. 

In  the  on-line  case,  we  observe  that  a  simple  s/a/fcmj/ adversary  causes  the  ACC  algorithm  to  per¬ 
form  (e.xpected)  work  of  f2(A^/  polylog  A)  in  the  case  of  fail-stop  errors,  and  Q.  ^( ^  ) 

work  in  the  case  of  fail-stop  errors  with  restart  even  when  using  P  <  — processors.  The 
stalking  adversary  strategy  consists  of  choosing  a  single  leaf  in  a  binary  tree  employed  by  ACC, 
and  failing  all  processors  that  touch  that  leaf  until  only  one  processor  remains  in  the  fail-stop  case, 
or  until  all  processors  simultaneously  touch  the  leaf  in  the  fail-stop/restaft  case.  This  performance 
is  not  improved  even  when  using  the  completed  work  accounting.  On  a  positive  note,  when  the 
adversary  is  made  off-line,  the  ACC  algorithm  becomes  efficient  in  the  fail-stop/restart  setting. 
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A  ALGORITHM  X  PSEUDOCODE 


forall  processors  PID=O..P  —  1  parbegin 

shared  — shared  memory 

shared  d[1..2iV-l];  —  “done”  heap  [progress  tree) 

shared  w(0.. P-1];  — “where”  array 

private  done,  where;  — current  node  done/where 

private  left,  right;  — left/right  child  values 


action, recovery 

w(PID]  :=  1  -i-  PID;  — the  initial  positions 
end  ; 


action, recovery 

while  w[PID]  7it  0  do  — while  haven’t  exiled  the  tree 
where  :=  w(PID];  —current  heap  location 
done  :=  d[where];  —doneness  of  this  subtree 
if  done  then  w[PID|  :=  where  div  2;  — move  up  one  level 
elseif  not  done  A  where  >  N  — 1  then  —at  a  leaf 

ifx(where— W]  =  0  then  x(where— jV]  :=  1;  — initialize  leaf 
elseif  x(where-/V]  =  1  then  d(where];:=  1;  — indicate  “done” 
fi 

elseif  not  done  A  where  <  iV  —  1  then  — interior  tree  node 

left  :=  d[2*where];.right  :=  d(2+where+l);  — read  left/right  child  values 
ifleftAright  then  d[where)  :=  1;  — both  children  done 
elseif  not  left  A  right  then  w(PlD]  :=  2*where;  — go  left 
elseif  left  A  not  right  then  w(PlD]  :=  2»where+l;  — -go  right 
elseif  not  left  A  not  right  then  — both  subtrees  are  not  done 
— move  down  according-to  the  PID  bit 
if  not  PlD(log(where)]  thenw(PlD]  :=  2*where;  — move  left 
elseif  PlD(log(where))  then  w(PlD]  :=  2*where+l;  — move  right 
fi 
fi 
fi 

od 

end 


parend . 


Figure  5:  Algorithm  X. 


A  Algorithm  X  pseudocode 

Here  we  give  detailed  pseudocode  for  algorithm  A”  on  the  restartable  fail-stop  model. 

In  the  pseudocode,  the  action,  recovery  end  construct  of  [SS  S3]  is  used  to  denote  the  actions 
and  the  recovery  procedures  for  the  processors.  In  the  algorithm  this  signifies  that  an  action  is  also 
its  own  recovery  action,  should  a  processor  fail  at  any  point  within  the  acti'on  block. 

The  notation  “PID[log(k)]”  is  used  to  denote  the  binary  tnio/false  value  of  the  [log(A:)J-th  bit 
nf  the  log(rV)-bit  representation  of  PID,  where  the  most  significant  bit  is  the  bit  number  0,  and  the 
loast  significant  bit  is  bit  number  logjV.  Finally,  div  stands  for  integer  division  with  truncation. 

The  action/recovery  construct  can  be  implemented  by  appropriately  checkpointing  the  instruc- 
lion  counter  in  stable  storage  as  the  last  instruction  of  an  action,  and  reading  the  instruction 
founter  upon  a  restart.  This  is  amenable  to  automatic  implementation  by  a  compiler. 
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To: 

T,:  ; 

shared  7o:=  1; 

shared  7i  :=  N] 

shared  /i ; 

.•diared  7o; 

private  tempO; 

private  tempi; 

shared  ifl.  .TV); 

shared  3:[1.  .A); 

repeat 

repeat 

— Invariant:  3:[^]  =  1  for  all  k  <  Iq 

— Invariant:  x[k]  =  1  for  all  k  >  7i 

if  3;[/o]  =  0  then 

if  a:[7i]  =  0  then 

x[7o]  :=  1; 

x[h]  ~  1; 

Iq  :=  /o  +  1; 

7i  :=  7,  -  1; 

elseif  /o  >  h  or  7o  ^  R.ight2  then 

elseif  7i  <  7o  or  7i  <  Left2  then 

Collision  with  Pi 

— -Collision  with  Pq 

7o:=iV  +  l; 

h  ■■=  0; 

else 

else 

—  Collision  with 

—  Collision  with  P2 

tempo  :=  Mid2; 

tempi  :=-Alid2; 

if  7o  <  Left2  then 

if  7i  >  Right2  then 

— Lelt2  Has  been  updated 

— Ilight2  has  been-updated 

7o  :=  Left.2 

7i  :=.llight2 

else 

else 

—  The  coh-ect  Mid2  was  read 

—  The  correct  U.\d.2  was  read 

7o  :=  max{2!*  tempO  -  Iq/Iq  +  1) 

I\  :=  min{2  *  temp!  7i ,  7i  —  1} 

fi 

fi 

fi 

fi 

until  7o  >  W  +  1; 

until  I\  <-0; 

Figure  6:  Algorithms  To  and  Ti 


It  is  possible  to  perforna  local  optimization  of  the  algorithm  by:  (i)  evenly  spacing  the  P 
processors  N/P  leaves  apart  by  when  P  <  N,  and  by  (ii)  using  the  integer  values  at  the  progress 
t  rop  nodes  to  represent  the  known  number  of  descendent  leaves  visited  by  the  algorithm.  Our  worst 
case  analysis  does  not  benefit  from  these  nVodifications. 

The  algorithm  can  be  used  to  solve  Write- All  “in  place”  using  the  array  x()  as  a  tree  of  height 
!og{jV/2)  with  the  leaves  x(iV/2..iV-l],  and  doubling  up  the  processors  at  the  leaves,  and  using  x(N] 
as  the  final  element  to  be  initialized  and  used  as  the  algorithm  termination  sentinel.  With  this 
modification,  array  d[]  is  not  needed.  The  asymptotic  efficiency  of  the  algorithm  is  not  affected. 

B  Algorithm  T  pseudocode 

The  code  for  algorithm  T  in  Figures  6  and  7  is  given  in  three  parts,  one  for  each  of  the  three 
processors  (algorithm  T,  for  processor  P,).  The  code  given  is  designed  for  easy  proof  oficorrectness, 
rather  than  optimality. 
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T2: 

shared  Left2  :=  1;  — left  boundary  of  current  wrtle  area 
shared  Right2  :=  N;  — right  boundary  of  current  write  area 
shared  Mid2  :=  fiV/2];  — middle  of  current  write  area 
shared  7o,  /i; 
shared  a:[l. .//]; 

private  i  :=  0:  — number  of  writes  in  current  area 
repeat 

^ — Invariant:  At  all  limes,  x[k]=  I  for  all  values  of  k  that  satisfy 
— 1  <  k  <  Left2  or  Mi(i2  —  i  <  k  <  Mid2  +  i  or  Right2  <  k  <  N 
case  (x[Mid2  -  i],  a:[Mid2  + 1])  is 
(0,0):  — Continue  writing  in  current  area 
j:(Mid2  -  i]  ;=  1; 

2[Mid2  +  f]  :=  1; 
i  :=  i+  1; 

(1,0):  — jump  to  the  right 
jumpright; 

(0,1):  — jump  to  the  left 
jumpleft: 

(1.1): 

i  :=  J  +  1 

if  7o  >  niid  then  jumpright  else  jumpleft  fi 
esac 

until  Left2  >  Right2  or  Mid2  -  i  <  Left 

procedure  jumpright: 

Left2  :=  Mid2  +  j; 

/  :=  0; 

Mid2  :=  f(Left2  +  Riglit2)/2l ; 

end 

procedure  jumpleft: 

Right2  :=  Mid2  -  /; 
z:=0; 

.Mid2  :=  f(Left2  +  Right2)/2l; 

end 


Figure  7:  Algorithm  T2 


'/’o  and  Ti  terminate  because  7o  increases  and  Ii  decreases  with  every  loop  iteration.  T2  ter¬ 
minates  because  every  loop  iteration  either  increases  i  or  decreases  RightB  —  Left2.  Since  any 
oxccution  of  algorithm  T  is  equivalent  to  some  serialized  execution,  the  following  lemma  implies 
that  ail  cells  of  the  array  x  are  1  at  termination. 

Lemma  B.l  Every  serialized  c.xccution  of  algorithm  T  maintains  the  following  invariants. 

1.  For  all  k  such  that  i  <  k  <  Iq,  cell  k  contains  1. 
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2.  For  all;  A:  such  that  L  <  k  <  iV,  cell  k  contains  1. 

3.  For  all  A:  such  that  1  <  k  <  Le.ft2,  cell  k  contains  1. 

4.  For  ^jA:  such  that  Right2  <  k  <  N,  cell  k  contains  1. 

■5.  For  all  A:  such  that  Mid2  -  i  <  k  <  Mid2  +  i,  cell  A:  contains  1. 

If  some  cell  A;  has  value  1,  then  at  least  one  of  the  following-holds. 

6.  Cell  A:  was  written  by  Pq  at  a  time  when  Iq  had  the  value  or 

7.  Cell  A  was  written  by  P\  at  a. time  when  I\  had  the  value  A:,  or 

8.  Cell  A:  was  written  by  P2  at  a  time  when  the  values  of  Mid2  and  i  satisfied  k  =  Mid2  ±i. 

Proof:  Inspection  of  the  code  reveals  that  the  consecutive  values  of  Iq  and  of  areniondecreas- 
ing,  and  the  values  of  Ii  and  of  Right2  a.Te  nonincreasing.  Also,  no  processor  writes  to  the  same 
cell  twice, .and  0  is  never  written. 

The  invariants  are  vacuous  at  the  start  of  the  algorithm.  It  is  necessary  and  sufficient  to  show 
that  every" operation  preserves  the  invariants.  The  last  three  are  trivial. 

The  assignments  Jo  :=  /o  +  1»  Jo  •=  N  +  I  and  Jo  :=  Left2  preserve  the  invariants  because 
of  the  comparisons  preceding  their  -execution  and  the  monotonicity  properties.  The  assignment 
Jo  :=  2*  tempo  -  Jo  is  executed  only  after  cell  Jo  has  been  found  to  have  been  written  by  Pj  only. 
The  variable  /empO  holds  a  value  of  Mid2  that  wais  valid  at  some  tirue  after  the  write  and  before 
Lcft2  was  increased  by  a  subsequent  e.xecution  of  procedure  jUmpright.  If  P2  had  not  yet  jumped, 
conditions  8  and  5  imply  the  pFt..crvation  of  condition  1.  Otherwise,  P2  Jumped  to  the  left  because 
of  a  collision  with  P\,  and  the  entire  array  has  been  written,  satisf^ng  all  oi  the  invariants. 

The  case  of  assignments  to  Ji  is  symmetrical. 

The  assignment  Lefts  •-  Mid2  +  i  is  e.xecuted  only  after  Pq  lias  written  to  cell  Mid2  -  i, 
and  hence  conditions  1,  5  and  6  imply  preservation  of  condition  3.  Similarly,  Rights  :=  Mid2  —  i 
is  executed  only  after  J\  has  written  to  cell  Mid2  +  i,  and  hence  conditions  2,  5  and  7  imply 
preservation  of  condition  4.  □ 

To  prove  the  desired  work  bound,  we  use  the  following  definition  of  a  collision  between  proces¬ 
sors. 

Definition  B.l  Pq  collides  with  P_,  (j  €  {1,2})  if  Pq  e.xecutes  the  code  block  labelled  “collision 
with  Pj.”  Pi  collides  with  Pj  (j  6  {0,2})  if  Pi  e.xecutes  the  code  block  labelled  “collision  with 
Pj."  P2  collides  with  Pq  if  P2  e.xecutes  procedure  jumpright.  Po  collides  with  Pi  if  P2  e.xecutes 
procedure  jumpleft. 

\  redundant  write  docs  not  imply  that  the  writing  processors  collide  with  one  another.  .N'everthe- 
lc.ss,  the  number  of  collisions  is  a  bound  on  the  number  of  redundant  writes. 

Lemma  B.2  Suppose  two  processors  both  write  to  cell  k.  Then  one  (or  both)  of  the  processors 
will  collide  in  its  next  loop  iteration. 
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Proof:  One  of  the  two  processors  must  be  Pq  or  Pi.  If  it  is  Pq,  then  the  other  will  ne.xt  attempt 
to  write  to  cell  k  —  1  and  collide.  If  it  is  Pi,;then  the  other  will  next  attempt  to  write  to  cell  k  +  1 
and  collide.  (In  either  case,  the  collision  may  involve  the  third  processor.)  □ 


Lemma  B.3  There  are  O(IogjVi)  collisions. 

Proof:  When  P2  jumps,  the  quantity  Right2  -  Left2  decreases  by  a  factor  of  at  least  2.  Hence  P2 
collides  at  most  log  jV  times.  Also,  Pq  can  collide  with  Pi,  and  Pi  with  Pq,  at  most  once  each. 

Suppose  Po  collides  with  P2  in  attempting  to  write  to  ceil  k.  Because  Pq  did  not  collide  with 
Pi,  P2  wrote  to  cell  A:  with  some  value  m  in  Mid2  and  the  value  m  —  k  in  i.  If  P2  continues  to 
process,  it  will  collide  with  either  Pq  or  Pi  after  at  most  two  iterations,  when  the  value  of  i  h^ 
become  m  —  k  +  2.  (The  worst  case  occurs  if  Po  and  P2  both  write  celLA:  —  1.)  Hence  the  only  cells 
that  P2  writes  with  m  in  Mid2  are  in  the  interval  [^•  —  1,2m  -  k  +  1).  Thus  Po  attempts  to  write 
at:  most  four  cells  in  the  interval  (i.e.,  cells  A:  —  1,:A:,  2m  —  A:  and  2m  —  k  +  1),  and  can  collide  only 
at  the  latter  three.  Therefore,  the  number  of  collisions  of  Po  with  P2  is  at  most  three  times  the 
number  of  collisio  ns  of  P2. 

Similarly,  the  number  of  collisions  of  Pi  with  P2  is  at  most  three  times  the  number  of  collisions 
of  P2.  Hence  the  total  number  of  collisions  in  O(IogiV),  as  required.  □ 

Each  collision  involves  only  a  constant  number  of  memory  accesses.  Thus  the  algorithm  satisfies 
the  required  work  bounds. 

Theorem  B.4  Algorithm  T  solves  the  Write- AIL prohlcm  for  three  processors  using  N  +  0(logiV) 
writes  to  and  N  +  C>(Ibg7V)  reads  from  the  array.  There  arc  at  most  N  -f  0(log  A^)  writes  and 
O(logjV)  reads  involving  auxiliary  variables. 

Proof:  The  result  follows  directly  from  the  above  discussion.  □ 

If  the  cells  of  array  x  can  hold  arbitrary  integer  values,  then  the  information  cpmniunicated  by 
t  he  values  of  the  shared  auxiliary  variables  can  be  stored  directly  in  the  array.  Processors  Pq  and 
Pi  write  -1  and  -2  respectively.  Processor  P2  writes  the  value  Mid2  +  i  when  writing  to  the  left 
of  ;V/uiS  and  the  value  Mid2  -  i  when  writing  to  the  right  of  Mid2.  In  this  case,  only  private  local 
variables  are  required. 


